IMAGE AUGMENTATION IS ALL YOU NEED: REGULARIZING DEEP REINFORCEMENT LEARNING FROM PIXELS

Abstract

Existing model-free reinforcement learning (RL) approaches are effective when trained on states but struggle to learn directly from image observations. We propose an augmentation technique that can be applied to standard model-free RL algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to transform input examples, as well as regularizing the value function and policy. Our approach reaches a new stateof-the-art performance on DeepMind control suite and Atari 100k benchmark, surpassing previous model-free (Haarnoja et al., 

1. INTRODUCTION

Sample-efficient deep reinforcement learning (RL) algorithms capable of directly training from image pixels would open up many real-world applications in control and robotics. However, simultaneously training a convolutional encoder alongside a policy network is challenging when given limited environment interaction, strong correlation between samples and a typically sparse reward signal. Limited supervision is a common problem across AI and two approaches are commonly taken: (i) training with an additional auxiliary losses, such as those based on self-supervised learning (SSL) and (ii) training with data augmentation. A wide range of auxiliary loss functions have been proposed to augment supervised objectives, e.g. weight regularization, noise injection (Hinton et al., 2012) , or various forms of auto-encoder (Kingma et al., 2014) . In RL, reconstruction losses (Jaderberg et al., 2017; Yarats et al., 2019) or SSL objectives (Dwibedi et al., 2018; Srinivas et al., 2020) are used. However, these objectives are unrelated to the task at hand, thus have no guarantee of inducing an appropriate representation for the policy network. SSL losses are highly effective in the large data regime, e.g. in domains such as vision (Chen et al., 2020; He et al., 2019) and NLP (Collobert et al., 2011; Devlin et al., 2018) where large (unlabeled) datasets are readily available. However, in sample-efficient RL, training data is more limited due to restricted interaction between the agent and the environment, limiting their effectiveness. Data augmentation methods are widely used in vision and speech domains, where output-invariant perturbations can easily be applied to the labeled input examples. Surprisingly, data augmentation * Equal contribution. Author ordering determined by coin flip. Both authors are corresponding. 1

