AUGMENTATION CURRICULUM LEARNING FOR GEN-ERALIZATION IN REINFORCEMENT LEARNING

Abstract

Many Reinforcement Learning tasks rely solely on pixel-based observations of the environment. During deployment, these observations can fall victim to visual perturbations and distortions, causing the agent's policy to significantly degrade in performance. This motivates the need for robust agents that can generalize in the face of visual distribution shift. One common technique for doing this is to apply augmentations during training; however, it comes at the cost of performance. We propose Augmentation Curriculum Learning a novel curriculum learning approach that schedules augmentation into training into a weak augmentation phase and strong augmentation phase. We also introduce a novel visual augmentation strategy that proves to aid in the benchmarks we evaluate on. Our method achieves state-of-the-art performance on Deep Mind Control Generalization Benchmark.

1. INTRODUCTION

Reinforcement Learning (RL) has shown great success in a large variety of problems from videogames Mnih et al. (2013) , navigation Wijmans et al. (2019) , and manipulation Levine et al. (2016) ; Kalashnikov et al. (2018) even while operating from high-dimensional pixel inputs. Despite this success, the policies produced by RL are only well suited for the same environment they were trained for and fail to generalize to new environments. Instead, agents overfit to task-irrelevant visual features, resulting in even simple visual distortions degrading policy performance. A key objective of image-based RL is building robust agents that can generalize beyond the training environment. et al.; Yarats et al. (2021) . Therefore, in this work, we focus on data augmented generalization to visual distribution shift while the semantics remain unchanged. We specifically aim to do this in a zero-shot manner, i.e., where shifted data is unavailable during training. Unlike for supervised and self-supervised image classification tasks, augmentation for pixel-based RL has demonstrated mixed levels of success. Prior work categorized augmentations into weak and strong ones based on downstream training performance Fan et al. (2021) . Specifically, works define weak augmentations as those allowing the agent to learn a policy with higher episodic rewards in the training environment than training without augmentation. Strong augmentations refers to augmentations that lead to empirically worse performance than training with no augmentations. Classifying augmentations according to this definition is dependent on the task. For example, cutout color has empirically been shown to be detrimental ("strong augmentation") for all tasks in Deep Mind Control Suite (DMC) Tassa et al. ( 2018 2021). However, these methods suffer from strong augmentation making training harder due to the difficulty of learning from such diverse visual observations, destabilizing training. This results in strong augmentations causing the agent to not learn a policy with as strong performance as using weak augmentations alone. In this work, we introduce a new training method that avoids the training instabilities caused by strong augmentations through a curriculum that separates augmented training into weak and strong training phases. Once the network has been sufficiently regularized in the weak augmentation phase, it is cloned to create a policy network that is trained on strong augmentations. This disentangles the responsibilities of the networks into accurately approximation the Q-value (network trained on weak augmentations) of the agent and generalization (doing well on shifted test distributions). Crucially we separate the two networks to avoid the destabilizing effect of strong augmentations. We also demonstrate the power of the method under even more severe augmentation, namely a new splicing augmentation that pastes relevant visual features into an irrelevant background. We show that our curriculum learning approach can effectively leverage strong augmentations, and the combination of our method with this new augmentation technique achieves state-of-the-art generalization performance. Our main contributions are summarized as follows: • We introduce Augmentation Curriculum Learning (AugCL), a new method for learning with strong visual augmentations for generalization to unseen environments in pixel-based RL. • A new visual augmentation named Splice, which by simulating distracting backgrounds helps prevent overfitting to task irrelevant features. • We demonstrate AugCL achieves state-of-the-art results across a suite of pixel-based RL generalization benchmarks. (Bengio et al., 2009) showed that this training style yielded better generalization results faster, and by introducing more difficult examples gradually, online training could be sped up. This ideology has shown to be transferable to RL across varying types of generalization (Cobbe et al., 2019) , (Wang et al., 2019 ), (Florensa et al., 2018 ), (Sukhbaatar et al., 2017) , but to our knowledge has never been explored for generalization to visual perturbations in pixel-based RL.

2.2. RL GENERALIZATION BENCHMARKS

There are many benchmarks designed for evaluating agents under different distribution shifts (Chattopadhyay et al., 2021; Dosovitskiy et al., 2017; Stone et al., 2021; Zhu et al., 2020; Li et al., 2021; Szot et al., 2021) . We chose Deep Mind Control Generalization Benchmark (DMC-GB) (Hansen & Wang, 2021) as the current SOTA methods have been benchmarked on this, allowing us to compare directly to the results shown in previous works. DMC-GB offers 4 different generalization modes: color easy, color hard, video easy, and video hard. These modes can be applied to all DMC tasks, and visual examples can be seen in Figure 4 . Color easy is not benchmarked on as it is considered solved (Hansen & Wang, 2021) . Color hard dynamically changes the color of the agent, background, and flooring. Video easy changes the background to another random image, and video hard will change both the background and floor to a random image. Under these extreme perturbations, the agent must learn to identify the relevant visual features to the task in order to maximize reward.

2.3. GENERALIZATION IN VISUAL RL

There have been many advances in visual RL generalization. In this section, we will briefly summarize each method. SODA (Hansen & Wang, 2021) leverages an approximate contrastive loss focused on minimizing the distance of embedded vectors of the same state augmented with crop and a strongly augmented copy closer together in hyper-dimensional space. SECANT (Fan et al., 2021) trains an agent using crop and then leverages that agent's policy for training a new agent under strong augmentation in an imitation learning fashion. The prior SOTA approach to DMC-GB was



Several existing approaches to training more robust agents include domain randomization Pinto et al. (2017); Tobin et al. (2017) and data augmentation Hansen & Wang (2021); Hansen et al. (2021b); Fan et al. (2021). Domain randomization modifies the training environment simulator to create more varied training data, whereas data augmentation deals with augmenting the image observations representing states without modifying the simulator itself. Prior work shows pixel-based augmentation improves sample efficiency and helps agents achieve performance matching state-based RL Laskin

), but is a effective ("weak augmentation") for Star Pilot in Procgen Cobbe et al. (2019) as shown in Laskin et al.. Methods exist that attempt to automate finding the optimal weak augmentation on a per-task basis Raileanu et al. (2020), but these still do not expand the effectiveness of many augmentations. Many RL generalization methods leverage weak augmentation for better policy learning training and add strong augmentations in training for generalization to visual distribution shift Hansen & Wang (2021); Hansen et al. (2021b); Fan et al. (

CURRICULUM LEARNING Inspired by how humans learn, (Elman, 1993) proposed training networks in a curriculum style by starting with easier training examples and then gradually increasing complexity as training ensues.

