AUGMENTATION CURRICULUM LEARNING FOR GEN-ERALIZATION IN REINFORCEMENT LEARNING

Abstract

Many Reinforcement Learning tasks rely solely on pixel-based observations of the environment. During deployment, these observations can fall victim to visual perturbations and distortions, causing the agent's policy to significantly degrade in performance. This motivates the need for robust agents that can generalize in the face of visual distribution shift. One common technique for doing this is to apply augmentations during training; however, it comes at the cost of performance. We propose Augmentation Curriculum Learning a novel curriculum learning approach that schedules augmentation into training into a weak augmentation phase and strong augmentation phase. We also introduce a novel visual augmentation strategy that proves to aid in the benchmarks we evaluate on. Our method achieves state-of-the-art performance on Deep Mind Control Generalization Benchmark.

1. INTRODUCTION

Reinforcement Learning (RL) has shown great success in a large variety of problems from videogames Mnih et al. (2013) , navigation Wijmans et al. (2019) , and manipulation Levine et al. (2016) ; Kalashnikov et al. (2018) even while operating from high-dimensional pixel inputs. Despite this success, the policies produced by RL are only well suited for the same environment they were trained for and fail to generalize to new environments. Instead, agents overfit to task-irrelevant visual features, resulting in even simple visual distortions degrading policy performance. A key objective of image-based RL is building robust agents that can generalize beyond the training environment. 



Several existing approaches to training more robust agents include domain randomization Pinto et al. (2017); Tobin et al. (2017) and data augmentation Hansen & Wang (2021); Hansen et al. (2021b); Fan et al. (2021). Domain randomization modifies the training environment simulator to create more varied training data, whereas data augmentation deals with augmenting the image observations representing states without modifying the simulator itself. Prior work shows pixel-based augmentation improves sample efficiency and helps agents achieve performance matching state-based RL Laskin et al.; Yarats et al. (2021). Therefore, in this work, we focus on data augmented generalization to visual distribution shift while the semantics remain unchanged. We specifically aim to do this in a zero-shot manner, i.e., where shifted data is unavailable during training. Unlike for supervised and self-supervised image classification tasks, augmentation for pixel-based RL has demonstrated mixed levels of success. Prior work categorized augmentations into weak and strong ones based on downstream training performance Fan et al. (2021). Specifically, works define weak augmentations as those allowing the agent to learn a policy with higher episodic rewards in the training environment than training without augmentation. Strong augmentations refers to augmentations that lead to empirically worse performance than training with no augmentations. Classifying augmentations according to this definition is dependent on the task. For example, cutout color has empirically been shown to be detrimental ("strong augmentation") for all tasks in Deep Mind Control Suite (DMC) Tassa et al. (2018), but is a effective ("weak augmentation") for Star Pilot in Procgen Cobbe et al. (2019) as shown in Laskin et al.. Methods exist that attempt to automate finding the optimal weak augmentation on a per-task basis Raileanu et al. (2020), but these still do not expand the effectiveness of many augmentations. Many RL generalization methods leverage weak augmentation for better policy learning training and add strong augmentations in training for generalization to visual distribution shift Hansen & Wang (2021); Hansen et al. (2021b); Fan et al. (2021). However, these methods suffer from strong

