ON THE DATA-EFFICIENCY WITH CONTRASTIVE IMAGE TRANSFORMATION IN REINFORCEMENT LEARNING

Abstract

Data-efficiency has always been an essential issue in pixel-based reinforcement learning (RL). As the agent not only learns decision-making but also meaningful representations from images. The line of reinforcement learning with data augmentation shows significant improvements in sample-efficiency. However, it is challenging to guarantee the optimality invariant transformation, that is, the augmented data are readily recognized as a completely different state by the agent. In the end, we propose a contrastive invariant transformation (CoIT), a simple yet promising learnable data augmentation combined with standard model-free algorithms to improve sample-efficiency. Concretely, the differentiable CoIT leverages original samples with augmented samples and hastens the state encoder for a contrastive invariant embedding. We evaluate our approach on DeepMind Control Suite and Atari100K. Empirical results verify advances using CoIT, enabling it to outperform the new state-of-the-art on various tasks.

1. INTRODUCTION

Improving data-efficiency to accomplish sequential decisions has always been a crucial problem in pixel-based reinforcement learning. As the agent has to learn an optimal policy with a meaningful information abstraction from observations parallel. Unlike supervised representation learning with strong supervised high-dimensional signals, the training process in RL is fragile. It could be harmful to the training process and cause performance degradation consequently using inappropriate manners. Hence, it is an urgent request to seek subtle representation learning methods for visual RL. Previous works have been proposed in the literature to demonstrate that introducing auxiliary loss functions such as pixel reconstruction (Yarats et al., 2019) and contrastive learning (Laskin et al., 2020b) alleviates this issue. In particular, data augmentations have already proven beneficial to dataefficiency. RAD (Laskin et al., 2020a ) performs an extension of experiments and widely analyzes the impact of various techniques in data augmentation. DrQ (Yarats et al., 2020) and DrQ-v2 (Yarats et al., 2021 ) make use of appropriate image augmentation with great success. Also, previous works have carried out the potential of data augmentation in terms of generalization (Hansen et al., 2021; Raileanu et al., 2020; Zhang & Guo, 2021; Hansen & Wang, 2021; Fan et al., 2021) . Despite the mentioned efforts, it is pretty hard to guarantee that the augmented representations are sufficiently diverse yet semantically consistent. To this end, we explore the underlying condition for representation learning in RL. It is rational to hypothesize that there is an optimal transformation enabling an encoder to abstract informative latent space. This line of works belongs to the regime of state abstraction (Du et al., 2019; Zhang et al., 2020b; Tomar et al., 2021; Wang et al., 2022) , which derives from grouping similar world-states for descriptions of the environment (Dietterich, 2000; Andre & Russell, 2002; Castro & Precup, 2010) . Inspired by spatial transformer networks (STN) (Jaderberg et al., 2015) , a data augmentation model in the vision domain, we consider that merging the parameterized transformation with visual RL could be beneficial. The designed transformation not only discovers the optimality of state abstraction but also produces diverse virtual samples for the agent. To do so, we enforce a learnable data augmentation that updates its parameters along with the RL objective. To understand parameterized augmentation and its relation to representation learning in RL, we focus on fundamental data manipulation by generating augmented data from a learnable Gaussian distribution. To be clear, we present the image transformation to control the margin of the augmentation under an RL training-friendly data distribution. Since changed data distributions meanwhile being controlled by learning algorithms would be helpful in high-dimensional cases (Balestriero et al., 2021) . Here we raise our idea: In light of this challenge, we present a contrastive invariant transformation (CoIT), a novel contrastive learning to ameliorate the data-efficiency for visual RL. CoIT integrates a learnable transformation for model-free methods with minimal modification to the architecture and training pipeline. Specifically, we parameterize the mean and variance of a Gaussian distribution for transforming data and update the parameters together with RL by using constraints to urge faster algorithm convergence empirically. As the learning goes on, the agent approximates the TRANSFORM distribution that is optimal for the task at hand to solve the task. In addition, we evaluate CoIT on DeepMind Control Suite and Atari100K, and experimental results demonstrate that the learnable transformation outperforms the current SOTA methods. Besides, our method does not claim any custom architecture choices and is essential for reproducing end-to-end training. Based on these results, we demonstrate that a learnable transformation improves dataefficiency effectively for visual RL. Key Contributions: (i) We present CoIT, a simple yet effective framework with a learnable image transformation that integrates invariant representations with model-free RL to improve data-efficiency. (ii) We propose a theoretical analysis of how our method can approximate a stationary distribution over the transformed data by the optimal invariant metric, thus learning better representations. (iii) We evaluate CoIT on popular benchmarks and show that our method outperforms previous state-of-the-art methods on data-efficiency and stability.

2. RELATED WORK

Several concurrent methods have been proposed for improving data-efficiency whose common ingredients containing data augmentation and self-supervised learning are listed. Data augmentation in RL. Like the success of data augmentation in computer vision (Zhong et al., 2020; DeVries & Taylor, 2017; Yun et al., 2019; Zhang et al., 2017) et al., 2020) proposed an effective augmentation method called random shift and introduced a regularization term for Q-learning. Based on DrQ, the DrQ-v2 (Yarats et al., 2021) conducted minimal changes and demonstrated that merely a simple augmentation method could match the state-of-the-art model-based algorithm on data-efficiency and performance. Self-supervised learning in RL. Motivated by the breakthrough in self-supervised learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020; Grill et al., 2020) , it is natural to combine these



* Work done during Institute of Automation, Chinese Academy of Sciences internship. † Corresponding Author.



Figure 1: Percentage (%) of score solved in the DMC. We set the score of DrQ-v2 as 100% and report the result of CoIT and CURL: Up: in 500K steps. Bottom: in 100K steps. The task is solved when return nearly reaches the upper bound.

, these methods have played a key role in improving the data-efficiency of visual RL problems Mnih et al. (2013); Yarats et al. (2019); Hafner et al. (2019); Lee et al. (2019). RAD (Laskin et al., 2020a) conducted mounts of experiments and finds out that different data augmentations lead to entirely different results. It provides a broader perspective for the follow-up study of data augmentation in RL. DrQ (Yarats

availability

//github.com/mooricAnna/

