IMAGE AUGMENTATION IS ALL YOU NEED: REGULARIZING DEEP REINFORCEMENT LEARNING FROM PIXELS

Abstract

Existing model-free reinforcement learning (RL) approaches are effective when trained on states but struggle to learn directly from image observations. We propose an augmentation technique that can be applied to standard model-free RL algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to transform input examples, as well as regularizing the value function and policy. Our approach reaches a new stateof-the-art performance on DeepMind control suite and Atari 100k benchmark, surpassing previous model-free (Haarnoja et al., 

1. INTRODUCTION

Sample-efficient deep reinforcement learning (RL) algorithms capable of directly training from image pixels would open up many real-world applications in control and robotics. However, simultaneously training a convolutional encoder alongside a policy network is challenging when given limited environment interaction, strong correlation between samples and a typically sparse reward signal. Limited supervision is a common problem across AI and two approaches are commonly taken: (i) training with an additional auxiliary losses, such as those based on self-supervised learning (SSL) and (ii) training with data augmentation. A wide range of auxiliary loss functions have been proposed to augment supervised objectives, e.g. weight regularization, noise injection (Hinton et al., 2012) , or various forms of auto-encoder (Kingma et al., 2014) . In RL, reconstruction losses (Jaderberg et al., 2017; Yarats et al., 2019) or SSL objectives (Dwibedi et al., 2018; Srinivas et al., 2020) are used. However, these objectives are unrelated to the task at hand, thus have no guarantee of inducing an appropriate representation for the policy network. SSL losses are highly effective in the large data regime, e.g. in domains such as vision (Chen et al., 2020; He et al., 2019) and NLP (Collobert et al., 2011; Devlin et al., 2018) where large (unlabeled) datasets are readily available. However, in sample-efficient RL, training data is more limited due to restricted interaction between the agent and the environment, limiting their effectiveness. Data augmentation methods are widely used in vision and speech domains, where output-invariant perturbations can easily be applied to the labeled input examples. Surprisingly, data augmentation * Equal contribution. Author ordering determined by coin flip. Both authors are corresponding. has received little attention in the RL community. In this paper we propose augmentation approaches appropriate for sample-efficient RL and comprehensively evaluate them. The key idea of our approach is to use standard image transformations to perturb input observations, as well as regularizing the Q-function learned by the critic so that different transformations of the same input image have similar Q-function values. No further modifications to standard actor-critic algorithms are required. Our study is, to the best of our knowledge, the first careful examination of image augmentation in sample-efficient RL. The main contributions of the paper are as follows: (i) the first to demonstrate that data augmentation greatly improves performance when training model-free RL algorithms from images; (ii) introducing a natural way to exploit MDP structure through two mechanisms for regularizing the value function, in a manner that is generally applicable to model-free RL and (iii) setting a new state-of-the-art performance on the standard DeepMind control suite (Tassa et al., 2018) , closing the gap between learning from states, and Atari 100k (Kaiser et al., 2019) benchmarks.

2. RELATED WORK

Data Augmentation in Computer Vision Data augmentation via image transformations has been used to improve generalization since the inception of convolutional networks (Becker & Hinton, 1992; Simard et al., 2003; LeCun et al., 1989; Ciresan et al., 2011; Ciregan et al., 2012) . Following AlexNet (Krizhevsky et al., 2012) , they have become a standard part of training pipelines. For object classification tasks, the transformations are selected to avoid changing the semantic category, i.e. translations, scales, color shifts, etc. While a similar set of transformations are potentially applicable to control tasks, the RL context does require modifications to be made to the underlying algorithm. Data augmentation methods have also been used in the context of self-supervised learning. Dosovitskiy et al. (2016) use per-exemplar perturbations in a unsupervised classification framework. More recently, several approaches (Chen et al., 2020; He et al., 2019; Misra & van der Maaten, 2019) have used invariance to imposed image transformations in contrastive learning schemes, producing state-of-the-art results on downstream recognition tasks. By contrast, our scheme addresses control tasks, utilizing different types of invariance. Data Augmentation in RL In contrast to computer vision, data augmentation is rarely used in RL. Certain approaches implicitly adopt it, for example Levine et al. (2018); Kalashnikov et al. (2018) use image augmentation as part of the AlexNet training pipeline without analysing the benefits occurring from it, thus being overlooked in subsequent work. HER (Andrychowicz et al., 2017) exploits information about the observation space by goal and reward relabeling, which can be viewed as a way to perform data augmentation. Other work uses data augmentation to improve generalization in domain transfer (Cobbe et al., 2018) . However, the classical image transformations used in vision have not previously been shown to definitively help on standard RL benchmarks. Concurrent with our work, RAD (Laskin et al., 2020) performs an exploration of different data augmentation approaches, but is limited to transformations of the image alone, without the additional augmentation of the Q-function used in our approach. Moreover, RAD can be regarded as a special case of our algorithm. Multiple follow ups to our initial preprint appeared on ArXiv (Raileanu et al., 2020; Okada & Taniguchi, 2020) , using similar techniques on other tasks, thus supporting the effectiveness and generality of data augmentation in RL. Continuous Control from Pixels There are a variety of methods addressing the sample-efficiency of RL algorithms that directly learn from pixels. The most prominent approaches for this can be classified into two groups, model-based and model-free methods. The model-based methods attempt to learn the system dynamics in order to acquire a compact latent representation of high-dimensional observations to later perform policy search (Hafner et al., 2018; Lee et al., 2019; Hafner et al., 2019) . In contrast, the model-free methods either learn the latent representation indirectly by optimizing the RL objective (Barth-Maron et al., 2018; Abdolmaleki et al., 2018) or by employing auxiliary losses that provide additional supervision (Yarats et al., 2019; Srinivas et al., 2020; Sermanet et al., 2018; Dwibedi et al., 2018) . Our approach is complementary to these methods and can be combined with them to improve performance.

