LIGHT-WEIGHT PROBING OF UNSUPERVISED REPRESEN-TATIONS FOR REINFORCEMENT LEARNING

Abstract

Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. To alleviate this issue, we design an evaluation protocol for unsupervised RL representations with lower variance and up to 600x lower computational cost. Inspired by the vision community, we propose two linear probing tasks: predicting the reward observed in a given state, and predicting the action of an expert in a given state. These two tasks are generally applicable to many RL domains, and we show through rigorous experimentation that they correlate strongly with the actual downstream control performance on the Atari100k Benchmark. This provides a better method for exploring the space of pretraining algorithms without the need of running RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective. Code will be released upon acceptance.

1. INTRODUCTION

Learning visual representations is a critical step towards solving many kinds of tasks, from supervised tasks such as image classification or object detection, to reinforcement learning (RL). Ever since the early successes of deep reinforcement learning (Mnih et al., 2015) , neural networks have been widely adopted to solve pixel-based reinforcement learning tasks such as arcade games (Bellemare et al., 2013) , physical continuous control (Todorov et al., 2012; Tassa et al., 2018), and complex video games (Synnaeve et al., 2018; Oh et al., 2016) . However, learning deep representations directly from rewards is a challenging task, since this learning signal is often noisy, sparse and delayed. With ongoing progress in unsupervised visual representation learning for vision tasks (Zbontar et al., 2021; Chen et al., 2020a; b; Grill et al., 2020; Caron et al., 2020; 2021) , recent efforts have likewise applied self-supervised techniques and ideas to improve representation learning for RL. Some promising approaches include supplementing the RL loss with self-supervised objectives (Laskin et al., 2020; Schwarzer et al., 2021a) , or first pre-training the representations on a corpus of trajectories (Schwarzer et al., 2021b; Stooke et al., 2021) . However, the diversity in the settings considered, as well as the self-supervised methods used, make it difficult to identify the core principles of successful self-supervised methods in RL. Moreover, estimating the performance of RL algorithms is notoriously challenging (Henderson et al., 2018; Agarwal et al., 2021) : it often requires repeating the same experience with a different random seed, and the high CPU-to-GPU ratio is a compute requirement of most online RL methods that is inefficient for typical research compute clusters. This hinders systematic exploration of the many design choices that characterize SSL methods. In this paper, we strive to provide a reliable and lightweight evaluation scheme for unsupervised visual representation in the context of RL. Inspired by the vision community, we propose to evaluate the representations using linear probing, by training a linear prediction head on top of frozen features. We devise two probing tasks that we deem widely applicable: predicting the reward in a given state, and predicting the action that would be taken by a fixed policy in a given state (for example that of an expert). We stress that these probing tasks are only used as a means of evaluation. Because Figure 1 : Left: Correlation between the SSL representations' abilities to linearly predict the presence of reward in a given state, versus RL performance using the same representations, measured as the interquartile mean of the human-normalized score (HNS) over 9 Atari games. Each point denotes a separate SSL pretraining method. A linear line of best fit is shown with 95 confidence interval. We compute Spearman's rank correlation coefficient (Spearman's r) and determine its statistical significance using permutation testing (with n = 50000). Right: When comparing two models, the reward probing score can give low variance reliable estimates of RL performance, while direct RL evaluation may require many seeds to reach meaningful differences in mean performance. very little supervised data is required, they are particularly suitable for situations where obtaining the expert trajectories or reward labels is expensive. Through thorough experimentation, we show that the performance of the SSL algorithms (in terms of their downstream RL outcomes) correlates with the performance in both probing tasks with statistically significant (p<0.001) Spearman's rank correlation, making them particularly effective proxies. Given the vastly reduced computational burden of linear evaluations, we argue that it enables much easier and straightforward experimentation of SSL design choices, paving the way for a more systematic exploration of the design space. Finally, we leverage this framework to systematically assess some key attributes of SSL methods. First off, we explore the utility and role of learning a forward model as part of the self-supervised objective. We investigate whether its expressiveness matters and show that equipping it with the ability to model uncertainty (through random latent variable) significantly improves the quality of the representations. Next, we identify several knobs in the self-supervised objective, allowing us to carefully tune the parameters in a principled way. Finally, we confirm the previous finding (Schwarzer et al., 2021b) that bigger architectures, when adequately pre-trained, tend to perform better. Our contributions can be summarized as follows: • Design of a rigorous and efficient SSL evaluation protocol in the context of RL • Empirical demonstration that this evaluation scheme correlates with downstream RL performance • Systematic exploration of design choices in existing SSL methods.

2.1. REPRESENTATION LEARNING

There has recently been a surge in interest and advances in the domain of self-supervised learning in computer vision. Some state-of-art techniques include contrastive learning methods SimCLR, MoCov2 (Chen et al., 2020a;b); clustering methods SwAV (Caron et al., 2020) ; distillation methods BYOL, SimSiam, OBoW (Grill et al., 2020; Chen and He, 2021; Gidaris et al., 2020) ; and information maximization methods Barlow Twins and VicReg (Zbontar et al., 2021; Bardes et al., 2021) . These advances have likewise stimulated development in representation learning for reinforcement learning. A line of work includes unsupervised losses as an auxiliary objective during RL training to improve data efficiency. Such objective can be contrastive (Laskin et al., 2020; Zhu et al., 2020) 

