RETURN-BASED CONTRASTIVE REPRESENTATION LEARNING FOR REINFORCEMENT LEARNING

Abstract

Recently, various auxiliary tasks have been proposed to accelerate representation learning and improve sample efficiency in deep reinforcement learning (RL). However, existing auxiliary tasks do not take the characteristics of RL problems into consideration and are unsupervised. By leveraging returns, the most important feedback signals in RL, we propose a novel auxiliary task that forces the learnt representations to discriminate state-action pairs with different returns. Our auxiliary loss is theoretically justified to learn representations that capture the structure of a new form of state-action abstraction, under which state-action pairs with similar return distributions are aggregated together. In low data regime, our algorithm outperforms strong baselines on complex tasks in Atari games and DeepMind Control suite, and achieves even better performance when combined with existing auxiliary tasks.

1. INTRODUCTION

Deep reinforcement learning (RL) algorithms can learn representations from high-dimensional inputs, as well as learn policies based on such representations to maximize long-term returns simultaneously. However, deep RL algorithms typically require large numbers of samples, which can be quite expensive to obtain (Mnih et al., 2015) . In contrast, it is usually much more sample efficient to learn policies with learned representations/extracted features (Srinivas et al., 2020) . To this end, various auxiliary tasks have been proposed to accelerate representation learning in aid of the main RL task (Suddarth and Kergosien, 1990; Sutton et al., 2011; Gelada et al., 2019; Bellemare et al., 2019; Franc ¸ois-Lavet et al., 2019; Shen et al., 2020; Zhang et al., 2020; Dabney et al., 2020; Srinivas et al., 2020) . Representative examples of auxiliary tasks include predicting the future in either the pixel space or the latent space with reconstruction-based losses (e.g., Jaderberg et al., 2016; Hafner et al., 2019a; b) . Recently, contrastive learning has been introduced to construct auxiliary tasks and achieves better performance compared to reconstruction based methods in accelerating RL algorithms (Oord et al., 2018; Srinivas et al., 2020) . Without the need to reconstruct inputs such as raw pixels, contrastive learning based methods can ignore irrelevant features such as static background in games and learn more compact representations. Oord et al. (2018) propose a contrastive representation learning method based on the temporal structure of state sequence. Srinivas et al. (2020) propose to leverage the prior knowledge from computer vision, learning representations that are invariant to image augmentation. However, existing works mainly construct contrastive auxiliary losses in an unsupervised manner, without considering feedback signals in RL problems as supervision. In this paper, we take a further step to leverage the return feedback to design a contrastive auxiliary loss to accelerate RL algorithms. Specifically, we propose a novel method, called Return-based Contrastive representation learning for Reinforcement Learning (RCRL). In our method, given an anchor state-action pair, we choose a state-action pair with the same or similar return as the positive sample, and a state-action pair with different return as the negative sample. Then, we train a discriminator to classify between positive and negative samples given the anchor based on their representations as the auxiliary task. The intuition here is to learn state-action representations that capture return-relevant features while ignoring return-irrelevant features. From a theoretical perspective, RCRL is supported by a novel state-action abstraction, called Z πirrelevance. Z π -irrelevance abstraction aggregates state-action pairs with similar return distributions under certain policy π. We show that Z π -irrelevance abstraction can reduce the size of the stateaction space (cf. Appendix A) as well as approximate the Q values arbitrarily accurately (cf. Section 4.1). We further propose a method called Z-learning that can calculate Z π -irrelevance abstraction with sampled returns rather than the return distribution, which is hardly available in practice. Zlearning can learn Z π -irrelevance abstraction provably efficiently. Our algorithm RCRL can be seen as the empirical version of Z-learning by making a few approximations such as integrating with deep RL algorithms, and collecting positive pairs within a consecutive segment in a trajectory of the anchors. We conduct experiments on Atari games (Bellemare et al., 2013) and DeepMind Control suite (Tassa et al., 2018) in low data regime. The experiment results show that our auxiliary task combined with Rainbow (Hessel et al., 2017) for discrete control tasks or SAC (Haarnoja et al., 2018) for continuous control tasks achieves superior performance over other state-of-the-art baselines for this regime. Our method can be further combined with existing unsupervised contrastive learning methods to achieve even better performance. We also perform a detailed analysis on how the representation changes during training with/without our auxiliary loss. We find that a good embedding network assigns similar/dissimilar representations to state-action pairs with similar/dissimilar return distributions, and our algorithm can boost such generalization and speed up training. Our contributions are summarized as follows: • We introduce a novel contrastive loss based on return, to learn return-relevant representations and speed up deep RL algorithms. • We theoretically build the connection between the contrastive loss and a new form of stateaction abstraction, which can reduce the size of the state-action space as well as approximate the Q values arbitrarily accurately. • Our algorithm achieves superior performance against strong baselines in Atari games and DeepMind Control suite in low data regime. Besides, the performance can be further enhanced when combined with existing auxiliary tasks.

2.1. AUXILIARY TASK

In reinforcement learning, the auxiliary task can be used for both the model-based setting and the model-free setting. In the model-based settings, world models can be used as auxiliary tasks and lead to better performance, such as CRAR (Franc ¸ois-Lavet et al., 2019 ), Dreamer (Hafner et al., 2019a ), and PlaNet (Hafner et al., 2019b) . Due to the complex components (e.g., the latent transition or reward module) in the world model, such methods are empirically unstable to train and relies on different regularizations to converge. In the model-free settings, many algorithms construct various auxiliary tasks to improve performance, such as predicting the future (Jaderberg et al., 2016; Shelhamer et al., 2016; Guo et al., 2020; Lee et al., 2020; Mazoure et al., 2020) , learning value functions with different rewards or under different policies (Veeriah et al., 2019; Schaul et al., 2015; Borsa et al., 2018; Bellemare et al., 2019; Dabney et al., 2020) , learning from many-goals (Veeriah et al., 2018) , or the combination of different auxiliary objectives (de Bruin et al., 2018) . Moreover, auxiliary tasks can be designed based on the prior knowledge about the environment (Mirowski et al., 2016; Shen et al., 2020; van der Pol et al., 2020) or the raw state representation (Srinivas et al., 2020) . Hessel et al. ( 2019) also apply auxiliary task to the multi-task RL setting.



* This work is conducted at Microsoft Research Asia. The first two authors contributed equally to this work.

