AUGMENTATION CURRICULUM LEARNING FOR GEN-ERALIZATION IN REINFORCEMENT LEARNING

Abstract

Many Reinforcement Learning tasks rely solely on pixel-based observations of the environment. During deployment, these observations can fall victim to visual perturbations and distortions, causing the agent's policy to significantly degrade in performance. This motivates the need for robust agents that can generalize in the face of visual distribution shift. One common technique for doing this is to apply augmentations during training; however, it comes at the cost of performance. We propose Augmentation Curriculum Learning a novel curriculum learning approach that schedules augmentation into training into a weak augmentation phase and strong augmentation phase. We also introduce a novel visual augmentation strategy that proves to aid in the benchmarks we evaluate on. Our method achieves state-of-the-art performance on Deep Mind Control Generalization Benchmark.

1. INTRODUCTION

Reinforcement Learning (RL) has shown great success in a large variety of problems from videogames Mnih et al. (2013) , navigation Wijmans et al. (2019) , and manipulation Levine et al. (2016) ; Kalashnikov et al. (2018) even while operating from high-dimensional pixel inputs. Despite this success, the policies produced by RL are only well suited for the same environment they were trained for and fail to generalize to new environments. Instead, agents overfit to task-irrelevant visual features, resulting in even simple visual distortions degrading policy performance. A key objective of image-based RL is building robust agents that can generalize beyond the training environment. Several existing approaches to training more robust agents include domain randomization Pinto et al. (2017) ; Tobin et al. (2017) and data augmentation Hansen & Wang (2021) ; Hansen et al. (2021b) ; Fan et al. (2021) . Domain randomization modifies the training environment simulator to create more varied training data, whereas data augmentation deals with augmenting the image observations representing states without modifying the simulator itself. Prior work shows pixel-based augmentation improves sample efficiency and helps agents achieve performance matching state-based RL Laskin et al.; Yarats et al. (2021) . Therefore, in this work, we focus on data augmented generalization to visual distribution shift while the semantics remain unchanged. We specifically aim to do this in a zero-shot manner, i.e., where shifted data is unavailable during training. Unlike for supervised and self-supervised image classification tasks, augmentation for pixel-based RL has demonstrated mixed levels of success. Prior work categorized augmentations into weak and strong ones based on downstream training performance Fan et al. (2021) . Specifically, works define weak augmentations as those allowing the agent to learn a policy with higher episodic rewards in the training environment than training without augmentation. Strong augmentations refers to augmentations that lead to empirically worse performance than training with no augmentations. Classifying augmentations according to this definition is dependent on the task. For example, cutout color has empirically been shown to be detrimental ("strong augmentation") for all tasks in Deep Mind Control Suite (DMC) Tassa et al. (2018) , but is a effective ("weak augmentation") for Star Pilot in Procgen Cobbe et al. (2019) as shown in Laskin et al. . Methods exist that attempt to automate finding the optimal weak augmentation on a per-task basis Raileanu et al. (2020) , but these still do not expand the effectiveness of many augmentations. Many RL generalization methods leverage weak augmentation for better policy learning training and add strong augmentations in training for generalization to visual distribution shift Hansen & Wang (2021) ; Hansen et al. (2021b) ; Fan et al. (2021) . However, these methods suffer from strong augmentation making training harder due to the difficulty of learning from such diverse visual observations, destabilizing training. This results in strong augmentations causing the agent to not learn a policy with as strong performance as using weak augmentations alone. In this work, we introduce a new training method that avoids the training instabilities caused by strong augmentations through a curriculum that separates augmented training into weak and strong training phases. Once the network has been sufficiently regularized in the weak augmentation phase, it is cloned to create a policy network that is trained on strong augmentations. This disentangles the responsibilities of the networks into accurately approximation the Q-value (network trained on weak augmentations) of the agent and generalization (doing well on shifted test distributions). Crucially we separate the two networks to avoid the destabilizing effect of strong augmentations. We also demonstrate the power of the method under even more severe augmentation, namely a new splicing augmentation that pastes relevant visual features into an irrelevant background. We show that our curriculum learning approach can effectively leverage strong augmentations, and the combination of our method with this new augmentation technique achieves state-of-the-art generalization performance. Our main contributions are summarized as follows: • We introduce Augmentation Curriculum Learning (AugCL), a new method for learning with strong visual augmentations for generalization to unseen environments in pixel-based RL. • A new visual augmentation named Splice, which by simulating distracting backgrounds helps prevent overfitting to task irrelevant features. • We demonstrate AugCL achieves state-of-the-art results across a suite of pixel-based RL generalization benchmarks.

2.1. CURRICULUM LEARNING

Inspired by how humans learn, (Elman, 1993) proposed training networks in a curriculum style by starting with easier training examples and then gradually increasing complexity as training ensues. (Bengio et al., 2009) showed that this training style yielded better generalization results faster, and by introducing more difficult examples gradually, online training could be sped up. This ideology has shown to be transferable to RL across varying types of generalization (Cobbe et al., 2019) , (Wang et al., 2019) , (Florensa et al., 2018) , (Sukhbaatar et al., 2017) , but to our knowledge has never been explored for generalization to visual perturbations in pixel-based RL.

2.2. RL GENERALIZATION BENCHMARKS

There are many benchmarks designed for evaluating agents under different distribution shifts (Chattopadhyay et al., 2021; Dosovitskiy et al., 2017; Stone et al., 2021; Zhu et al., 2020; Li et al., 2021; Szot et al., 2021) . We chose Deep Mind Control Generalization Benchmark (DMC-GB) (Hansen & Wang, 2021) as the current SOTA methods have been benchmarked on this, allowing us to compare directly to the results shown in previous works. DMC-GB offers 4 different generalization modes: color easy, color hard, video easy, and video hard. These modes can be applied to all DMC tasks, and visual examples can be seen in Figure 4 . Color easy is not benchmarked on as it is considered solved (Hansen & Wang, 2021) . Color hard dynamically changes the color of the agent, background, and flooring. Video easy changes the background to another random image, and video hard will change both the background and floor to a random image. Under these extreme perturbations, the agent must learn to identify the relevant visual features to the task in order to maximize reward.

2.3. GENERALIZATION IN VISUAL RL

There have been many advances in visual RL generalization. In this section, we will briefly summarize each method. SODA (Hansen & Wang, 2021) leverages an approximate contrastive loss focused on minimizing the distance of embedded vectors of the same state augmented with crop and a strongly augmented copy closer together in hyper-dimensional space. SECANT (Fan et al., 2021) trains an agent using crop and then leverages that agent's policy for training a new agent under strong augmentation in an imitation learning fashion. The prior SOTA approach to DMC-GB was SVEA (Hansen et al., 2021b) . SVEA leverages data-mixing, which modifies the critic loss Equation (1) to a weighted weak augmented and strong augmented temporal difference error (Sutton & Barto, 2018) . We take inspiration from the SVEA in our method by bootstrapping a Q target network that has gradient updates under weak augmentation for a critic trained under strong augmentation.

3.1. PROBLEM: GENERALIZATION IN PIXEL CONTROL TASKS

We formulate our problem as interacting with a Markov Decision Process (MDP) defined as the tuple M = (S, A, P, R, γ) for state-space S, action space A, transition distribution P(s ′ |s, a), reward function R(s, a), and discount factor γ. In our setting, images o ∈ O offer only partial observability. We therefore represent the state with k stacked observations to maintain the Markov property. The goal is to learn a policy π(a t |s t ) which maps states to a distribution over actions to maximize R = E τ ∼π T t=1 γ t R(s t , a t ) . In this work we focus on learning a policy that generalizes to new MDPs M . M has a new observation space, but other elements of the MDP are unchanged. Policy π is then evaluted on the return on new MDPs, without any additional samples from the new MDPs for updating π.

3.2. SOFT ACTOR-CRITIC

SAC (Haarnoja et al., 2017) is considered the state-of-the-art (SOTA) off-policy RL algorithm for most continuous control tasks. SAC trains an actor network π ψ (a t |s t ) to take actions and critic network Q ϕ (s t , a t ) to predict state-action values. SAC follows the maximum entropy RL principle which is to maximize actor entropy while maximizing expected reward E st,at∼π [ t r t + αH(π(•|s t ))]. The critic is trained by minimizing the temporal difference error using samples τ t = (s t , a t , s t+1 , r t ) ∼ D, where D is a replay buffer storing the agent's previous interactions with the environment during training. The critic parameters can be updated using an approximation to the expected rewards. Let ϕ be the critic parameters and ϕ target be another set of critic parameters for the bootstrapped value target. L Q (ϕ) = E τ ∼D [(Q ϕ (s t , a t ) -(r t + γV (s t+1 ; ϕ target ))) 2 ] (1) The value of the next state is estimated through the single-step boostrap target based on the target critic network parameters: V (s t+1 ; ϕ target ) = E a ′ ∼π [Q ϕ target (s t+1 , a ′ ) -α log π ψ (a ′ |s t+1 )] Target critic parameters ϕ target are typically updated as an exponential moving average (EMA) of the regular critic parameters ϕ. The actor network learns a policy by minimizing the following. L π (ψ; ϕ) = -E a∼π [Q ϕ (s t , a) -α log π ψ (a|s t )] Typically, π ψ is represented by a Gaussian distribution for continuous control and updates are made via the reparameterization trick (Kingma & Welling, 2013) and α is learned in relation to entropy. All methods in the experiments section 5 use the same base neural architecture and SAC with EMA updates of Q-target network from the critic network. A key thing to note is that the actor and critic share the same CNN encoder. This is crucial to the success of our's and previous methods. More details in appendix C.

4.1. AGENT ARCHITECTURE

The key differences between AugCL and previous works are: 1) We train a weak and strong augmented network in parallel. 2) We train only on weak augmentation for the early phases of training. 3) We bootstrap the strong augmented network from a weakly augmented Q target network. The driving intuitions behind this is strong augmentations incur non-zero degradation to policy learning but help with generalization. Hence to mitigate this issue, we have a separate network trained only on strong augmentation and update the target network using EMA from the weak augmented network. This duo circumvents the policy degradation by allowing the weak critic and Q target network to learn a state and action value approximation without being impeded by strong augmentation. Then by bootstrapping this Q target network with an accurate approximation of the value function, the strong augmented network learns to generalize under strong augmentation. The weak augmented pre-training is required to regularize the CNN from biasing itself to high-frequency features. We show both are necessary to achieve SOTA performance in Section 5.3. AugCL builds off of SAC, learning a critic and policy network as shown in Figure 1 . The critic network Q ϕ (s t , a t ) takes as input a stacked sequence of image observations, and an action produces the expected return for taking action a t in state s t . We also learn a parameterized policy π ψ (s t ) that outputs a normal distribution parameterized by a learned mean and variance which then samples an action for the current state. Q ϕ and π ψ share a visual encoder h µ (s t ) which takes as input the highdimensional image observations and produces a low-dimensional state encoding. The parameters of ψ and ϕ both include the encoder parameters µ, which are updated as a part of both the policy and critic losses. AugCL trains Q ϕ and π ψ through the standard SAC losses with random image augmentations. We sample augmentations f from a distribution over augmentations F. We then use f to augment the states s t in the SAC losses from Section 3.2. We augment only the current state s t in the SAC loss similar to (Hansen et al., 2021b) since we found this improves stability as shown in sec B. The resulting losses for the critic and policy with the augmentations are respectively: L Q (ϕ; ϕ target , F) = E τ ∼D,f ∼F [(Q ϕ (f (s t ), a t ) -(r t + γV (s t+1 ; ϕ target ))) 2 ] (4) L π (ψ; ϕ, F) = -E a∼π,f ∼F [Q ϕ (f (s t ), a) -α log π ψ (a|f (s t ))] Notice that the losses are defined relative to a distribution over augmentations F. The choice of this augmentation distribution is an important consideration in the algorithm's stability and robustness to new MDPs. Prior work breaks up augmentations for pixel-based control into two classes: weak augmentations and strong augmentations. Weak Augmentations (denoted as F W ) are augmentations that help stabilize and improve training performance in the source MDP M, but are insufficient for generalization to new MDPs M. (Cetin et al., 2022) shows that training with weak augmentations is important to prevent overfitting to high-frequency features in the image space when learning from bootstrapped targets in actor-critic methods. The consequences of this overfitting have been shown to cause the critic network to overfit to its own predictions, and a decrease in correlation to Monte Carlo returns as training ensues. Training with weak augmentations is therefore an important part of any actor-critic control-from-pixels method, such as AugCL. Known weak augmentations for all DMC tasks are: crop, translate (Laskin et al.) and shift (Yarats et al., 2021) . Strong Augmentations (denoted as F S ) are augmentations that help improve policy performance in new MDPs M. Visual perturbations such as random color changes of the environment can be simulated with strong augmentations such as: random convolution and random color jitter. While distracting backgrounds can be simulated with mix-up (Zhang et al., 2017) , these augmentations are difficult to train with, as they increase the difficulty of the learning problem due to the duo of stochastic parameter sampling for F S and high visual variance between samples. AugCL addresses how to effectively incorporate strong augmentations such as: random convolution, overlay (variation of mix-up), and our novel augmentation into training.

4.2. AUGCL: CURRICULUM LEARNING WITH STRONG AUGMENTATIONS

As mentioned, strong augmentations are detrimental to learning but important to train with for generalization performance. The key idea of our method is therefore to leverage curriculum learning (Bengio et al., 2009)  ϕ S = ϕ W ▷ Clone critic 10: {s i , a i , r i , s i+1 |i = 1, ..., N } ∼ B ▷ Sample transition 11: ϕ W ← ϕ W -β∇ ϕ L Q (ϕ W ; F W , ϕ target ) ▷ Weak Critic loss 12: if t ≥ M then 13: ψ ← ψ -α∇ ψ L π (ψ; F S ) ▷ Strong Actor loss 14: ϕ S ← ϕ S -β∇ ϕ L Q (ϕ S ; F S , ϕ target ) ▷ Strong Critic loss 15: else 16: ψ ← ψ -α∆ ψ L π (ψ; F W ) ▷ Weak Actor loss 17: ϕ target ← (1 -ζ)ϕ target + ζϕ W ▷ Q-target EMA update AugCL is described in Algorithm 1. AugCL begins by acting in the environment with π ψ and then adding the observed transition to the replay buffer (lines 5-7). We then sample data batches from the replay buffer for updating the policy and critic. Our curriculum learning schedule breaks the updates into two phases. For the first M policy updates, AugCL is in the weak augmentation phase and updates the critic and policy from weak augmentations alone (lines 11,16). Target critic parameters ϕ target are updated as exponential moving averages of the learned critic parameters ϕ (line 17), and used in the bootstrap term of the critic loss. The purpose of the weak augmentation phase is to stabilize policy learning. Prior work shows that it is easy for the critic to overfit in imagebased RL and weak augmentations are important to achieve strong training performance (Cetin et al., 2022) . However, the weak augmentations do not make the policy robust to new visuals. Improving generalization performance is the purpose of the next strong augmentation phase. Then, after M policy updates, AugCL switches to the strong augmentation phase and incorporates strong augmentations into training (lines 12-14). A new strong critic network ϕ S is copied from the weak critic network ϕ W (line 9). Now, the weak critic network is updated like in the previous phase by training with weak augmentations. However, the separate strong augmentation network with parameters ϕ S is now trained with strong augmentations (line 14). The policy is also updated with the strong augmentations (line 1) . Separating the strong and weak augmentations into two networks is important for stability. The weak critic helps stabilize bootstrap targets by leveraging weak augmentation to better approximate the state, action value function while the strong critic focuses on generalization performance. Previous methods have attempted this parallel training of the weak and strong augmented network, but with little success (Fan et al., 2021) . We show that this is due to not regularizing the CNN encoder first in sec 5.3. A figure of the architecture and the flow of tensors representing input and output of each neural layer can be seen in Figure 1 . The advantage of AugCL separating strong augmentations into a later phase of learning is it does not require a delicate balance between potentially conflicting losses from strong and weak augmentations. SODA and SVEA incorporate strong augmentations as an auxiliary learning signal that is always applied in conjunction with learning an accurate approximation of the state, action value from weakly augmented data. By learning from both data at the same time, the networks must contend with the trade off between stronger augmentations improving generalization yet harming training performance. The auxiliary objective in SODA may suffer from gradient interference from the conflicting losses as the critic network is optimized to learn an accurate state, action value approximation, and a contrastive loss in parallel. SVEA suffers from a similar issue in that it requires a hyperparameter to balance the combination of losses from strong and weak augmented data. On the other hand, AugCL disentangles the responsibilities of state, action value approximation, and generalization between the weak augmented critic and the strong augmented critic respectively. We empirically demonstrate this by showing AugCL performs better than SVEA and SODA on a variety of benchmarks and has minimal train environment performance degradation, as shown in Appendix D. Also note SODA and SVEA only pass weakly augmented observations to the policy network during training, the issue with this is that π ϕ (F W (s t )) ̸ = π ϕ (F S (s t )). AugCL does not suffer from this issue as we pass F S (s t ) through the policy network at train time, and interestingly we find train performance still improves and converges to a marginally worse performance on the train environment than training with weak augmentation alone as shown in appendix D We summarize the key differences between AugCL and prior work in Appendix Table 5 .

4.3. SPLICE AUGMENTATION

Since AugCL is well suited to train with challenging strong augmentations, we introduce a novel augmentation called "Splice" to improve generalization in visual RL. RL generalization benchmarks that incorporate background distractions (Hansen & Wang, 2021; Stone et al., 2021) are challenging for state-of-the-art visual RL approaches. The standard solution is to introduce a variation of mixup augmentation (Zhang et al., 2017) to RL training (Hansen & Wang, 2021; Fan et al., 2021; Hansen et al., 2021b) . (Hansen & Wang, 2021) theorized that previous strategies failed to adapt to severe background distractions because task-relevant visual features such as the agent's shadow were removed. Our new augmentation Splice solves this issue by pasting relevant visual features into an irrelevant background. This explicit separation of task-relevant versus task-irrelevant features helps generalization. Specifically, we mask out all non-relevant parts of the visual observation through a segmentation mask which is available in the simulation. We then replace all the non-task parts of the image with a random background image. We use COCO (Lin et al., 2014) for our experiments as the background replacement images. Further discussion and PyTorch (Paszke et al., 2019) style code for Splice and a visual example is provided in Appendix F.

5. EXPERIMENTS

We now evaluate AugCL and baselines on how well they can generalize to visual distribution shifts in the DMControl Generalization Benchmark (DMC-GB). In Section 5.1, we describe the experimental setup for how our method and baselines are configured. Next, in Section 5.2, we show that AugCL achieves state-of-the-art performance in the majority of settings in DMC-GB. Finally, in Section 5. 

5.1. EXPERIMENTAL SETUP

Environments and Evaluation: The purpose of our experiments is to evaluate how well policies trained with various methods can generalize to new visual disturbances. All methods are first trained in a source environment without any visual disturbances. We then evaluate the trained policy in the same environment but with random visual disturbances. DMC-GB tests how methods can generalize to random colors, backgrounds, and camera poses. All methods are trained for 500,000 frames and evaluated on 5 tasks from DMC-GB in three different evaluation settings from DMC-GB (colorhard, video-easy, and video-hard). The 5 tasks from DMC-GB used in this paper are described in Appendix Table 8 . We report the mean and standard deviation across 5 seeds per method, where each seed is evaluated by taking the average episode return across 30 episodes. For table 2 and table 3 we added training with SVEA during the weak augmentation phase. Baselines: We compare AugCL against other recent pixel-based RL methods, some of which were explicitly designed for learning robust policies that can generalize to unseen environments. Specifically, we compare against CURL, RAD, SVEA, SODA, DrQ as well as PAD (Hansen et al., 2021a) , which adapts to the test environment using self-supervision. Hyperparameters and baseline results are taken from (Hansen et al., 2021b) . We don't compare to SECANT as it requires double the training frames to all other baselines and requires training 2 models sequentially. Also note that SVEA and SODA augment each batch twice, thus doubling the data the agent trains on whereas AugCL only uses a single batch. Data Augmentation Setup: We apply random shift (Yarats et al., 2021) as our weak augmentation for AugCL. For all experiments, we selected M = 200, 000 for AugCL, meaning we first perform 200k updates in the weak augmentation phase before switching to the strong augmentation phase. All hyperparameters shared between AugCL and baselines are kept the same. Random convolution produced the best results in prior works on color hard and overlay for video DMC-GB benchmarks (Hansen et al., 2021b) , and we therefore use those augmentations for their respective benchmarks. Note that DrQ, AugCL and SVEA all use shift as their weak augmentation and CURL, PAD, RAD and SODA use crop. This is important to note as shift has been shown to give stronger empirical results than crop in DMC tasks (Yarats et al., 2021) . "Overlay" in tables 2, 3 refers to (Hansen & Wang, 2021) version of mix-up. The original SVEA and SODA paper use the Places dataset (Zhou et al., 2018a) for Overlay, but during the time this paper was written Places was unavailable due to maintenance, so instead, we used COCO for AugCL. We felt this was a fair comparison as long as both datasets were different from RealEstate10k (Zhou et al., 2018b) , which is used by DMC-GB. A full list of hyperparameters can be found in table 4. We apply random shift (Yarats et al., 2021) as our weak augmentation for AugCL and set M = 200, 000, which we determined empirically. All overlapping hyper-parameters between methods are kept the same. We also include results using Splice on the DMC-GB video easy and video hard benchmarks. In the DMControl tasks, we segment out the agent by filtering, converting the RGB image to HSV, and Table 3 : Results from DMC-GB benchmark video hard. AugCL with the splice augmentation outperforms baselines in all of the tasks. then taking pixels with HSV values only greater than a threshold. We set a consistent value threshold of 0.6 for all tasks to remove all aspects of the image, including the shadow, leaving only the agent. The hue threshold was 0 for all tasks except "Cartpole, Swingup" which required the hue threshold to be set to 3.5 due to the background being a mix of lighter and darker blues in these environments. The saturation threshold was set to 0 for all tasks. The full list of hyperparameters can be found in Appendix Table 4 .

5.2. DMC-GB RESULTS

Firstly, Table 1 shows that AugCL outperforms all baselines in 4 out of 5 tasks in DMC-GB colorhard environments. This further closes the gap between the performance of the policy from training and its generalization performance on the test environment. Ball In Cup, Catch and Finger, Spin under color hard are close to matching the current SOTA in the train environment thanks to SVEA and AugCL as shown in Table 7 . Despite SODA and SVEA using the same augmentations as AugCL and also being designed for generalization in pixel-based RL, AugCL outperforms them in evaluation return. Next, in the DMC-GB video easy and video hard environments AugCL again outperforms baselines in almost all of the settings. AugCL outperforms baselines in 4 out of 5 tasks in video-easy (Table 2 ) and in 5 out of 5 tasks in video hard (Table 3 ). A combination of the splice augmentation and AugCL performs best. Splice combined with AugCL does well on "Finger, Spin" under video easy as it's only 1 average episodic reward off from the train environment SOTA as seen in Table 7 . We theorize that Splice performs better than Overlay because Overlay is a weighted sum of pixels from the state image and an irrelevant image. (Laskin et al.; Cetin et al., 2022) showed the utility of weak augmentation was that a regularized CNN improved spatial attention mapping to task relevant features. Overlay may impede this process by making task relevant features less visible.

5.3. ABLATIONS

We analyze two design choices of AugCL: the curriculum and using separate critic networks for weak and strong augmentations. We use Random Convolution as the strong augmentation for all variations of AugCL in this section. No Pretrain in Figure 2 represents AugCL, but without the copying of weights to the strongly augmented network at step M (omitting line 9 in Alg. 1). Single Critic: represents having a single critic network for both strong and weak augmentation (omitting line 11 and line 17 becomes ϕ target ← (1 -ζ)ϕ target + ζϕ S in Alg. 1). As we can see without the pre-training even while bootstrapping from a Q-target network updated using a weakly augmented network. No Pretrain is much higher variance across the seeds and not able to match AugCL's performance on the walker walk task, thus showing the importance of first regularizing the CNN on weak augmentation, which motivates the curriculum. While Single Critic performs much better than No Pretrain, we can see it's much less sample efficient and converges to a lower solution than AugCL on both the train environment and the test environment. We can see that while Single Critic is improving the policy it is learning it is does not perform as well as distributing learning the state, action value to the weakly augmented network, thus showing the non-zero destabilization strong augmentation incurs. We also experimented with setting M = 0 and found it was unable to learn as the strong critic couldn't learn a useful representation and could not bootstrap the target network's predictions to improve learning. We include further exploration of selecting M in appendix A. This was indicated to us by the episodic reward not improving as training continued. We believe this also points towards the importance of weak augmentation regularization early in training. We believe that these experiments are clear evidence that strong augmentations do indeed incur a non-zero degradation to policy learning and that the key to getting good generalization performance is to disentangle the strong and weak augmentation as AugCL does, which is not possible without the curriculum as an unregularized CNN has difficulty learning task relevant representations.

6. CONCLUSION

AugCL shows tremendous improvements to generalized environments by disentangling strong and weak augmentations into their respective networks. The combination of AugCL and Splice have substantially improved performance on DMC-GB, giving a new SOTA. We also effectively show the importance of weak augmented pre-training for parallel weak and strong augmented network training, highlighting the missing ingredients to previous attempts. An issue with our method is selecting the optimal M is still an open question. M = 100, 000 yielded much worse results and we theorized that the CNN was not regularized enough. This tricky balance can lead to significant changes in results and we hope to find a more developed method for selecting M .

A CHOICE OF M ON PERFORMANCE

We explored how different M selections effect AugCL in figure 3 . We found a parabolic relationship between M and average episodic reward. We see on the test environment the relationship between M and test environment performance is parabolic, with performance peaking at M=100k or 200k on the test environment. While it seemed that 100k and 200k for M gave the same performance higher and lower values of M had much lower average performance. We theorize this is due to striking a good balance between regularizing the CNN with weak augmentation and then training it to adapt to the strongly augmented version of the environment, which requires a lot of frames to approximate.

B DRQ VS RAD VS NON-NAIVE AUGMENTATION

In this paper we also explored the RAD style of augmentation Laskin et al. versus DrQ Yarats et al. (2021) and Non-Naive Augmentation Hansen et al. (2021b) . In order to get the best generalization results possible we required achieving best results possible on the train environment as that acts as an upperbound to what can be achieved during generalization. We theorized the main strengths of DrQ stem from the shift augmentation introduced and the multiple sampling are for mitigating the issues introduced by RAD style augmentation. We show in Figure 5 that Non-Naive Augmentation does indeed perform better or equal to DrQ and RAD style augmentation in terms of average episodic reward on the train environment at the end of 500,000 train steps and is more sample efficient on 4 out of 5 tasks. RAD is equivalent to "DrQ 1k1m" in Figure 5 . This is an exciting development as , 100k, 200k, 300k, 400k, 500k] . AugCL is trained with random convolution as the strong augmentation on Walker, Walk for a total of 500k frames. Blue circles represent performance of different seeds using the average episodic reward over 50 runs and the red x represents the mean across all 3 seeds. sampling multiple augmentations of images can be costly especially with off-policy methods which are known to take multiple days to train. Though more exploration is required as we only ran this assessment on 5 tasks in DMC, we encourage practitioners of pixel-based RL to leverage Non-Naive Augmentation over the current alternatives. Results shown in Figure 5 .

C NEURAL ARCHITECTURE

In this section we go into more detail about our neural architecture. Neural architecture are adopted from [Hansen et al. (2021b) ]. A Convolutional Neural Network (CNN) encoder is shared between the actor and critic network. The encoder is only updated with respects to the Equation (1) and is detached during back propagation through Equation (3). The encoder is implemented with an 11layer CNN encoder that takes a stack of RGB frames rendered at 84 × 84 × 3 and output features of size 32 × 21 × 21 where 32 is number of channels and 21 × 21 spatial feature mapping dimension. All convolutional layers use 32 filters and 3 × 3 kernels. The first convolutional layer has a stride of 2, while all others have a stride of 1. The CNN encoder is shared between the actor and critic and followed by linear projection layers for both the actor and critic. The linear projection layers consist of three fully connected layers with hidden dimension of 1024. All hyperparameters are shown in table 4 .

D AUGCL TRAIN ENVIRONMENT PERFORMANCE

We believe a key aspect of generalization includes maintaining the best policy possible on the train environment as well. We include in Table 6 train environment performance of AugCL across different strong augmentations we benchmarked on. We also include results for non-naive shift on the train environment as well as an upper bound to what all the generalized methods can achieve in Table 7 .

E TASKS DESCRIPTIONS

In this section we include a table with a brief description, dimensions of the action vector and whether the task is dense or sparse reward task in Table 8 .

F MORE DETAILS ON SPLICE AUGMENTATION

The inspiration for Splice came when we noticed that in many robotics task the relevant visual features had higher brightness. We noticed that DMC fit this criteria well as the ground and background tended to be a dark blue, while the agent is a combination of bright colors (typically yellow and a bright blue). Splice converts an RGB image to HSV color space then sets a threshold for hue, saturation and value. We use kornia [Riba et al. (2020) frames becomes the task relevant features. Example code is given below an a comparative example is shown in Figure 6 . 



Figure 1: Neural architecture and tensor flow across different phases for AugCL.

Figure 2: No Pretrain: No weak augmented pre-training on the strong network. Single Critic: A single critic is used for training under weak and strong augmentation. The line represents the mean over 3 seeds, and the shadow represents variance. Figure 2a shows performance on the train environment, and 2b shows performance during training on DMC-GB color hard. The lines represent averages and the shaded regions the standard deviation of the results across 5 seeds.

Figure 3: We show the effects of M selection on AugCL by choosing [0, 100k, 200k, 300k, 400k,  500k]. AugCL is trained with random convolution as the strong augmentation on Walker, Walk for a total of 500k frames. Blue circles represent performance of different seeds using the average episodic reward over 50 runs and the red x represents the mean across all 3 seeds.

Figure 4: Image taken from [Hansen et al. (2021b)] with examples from cartpole and walker of DMC-GB.

Figure 5: We compare DrQ and Non-Naive Augmentation on the train environment. DrQ with k, m=1 and k, m=2 where k represents number of augmented samples for s t in Equation (1) and m represents number of augmented samples for s t+1 , all methods use shift. We find that non-naive augmentation with shift is more sample efficient and lower performance variance between seeds on 4 out of 5 tasks. The mean and standard deviation of the average of 30 episodes across 3 seeds: [0, 2] are used to generate plots. The lines represent the means and the shadow represents the variance.

i m p o r t t o r c h i m p o r t k o r n i a d e f s p l i c e ( x , h u e t , s a t u r a t i o n t , v a l u e t ) : b , , h , w = x . s h a p e x HSV = k o r n i a . c o l o r . r g b t o h s v ( x ) o v e r l a y = s a m p l e b a c k g r o u n d ( b a t c h s i z e =b ) t h r e s h o l d s = t o r c h . F l o a t T e n s o r ( [ h u e t , s a t u r a t i o n t , v a l u e t ] ) t h r e s h o l d s = t h r e s h o l d s . view ( 1 , -1 , 1 , 1 ) . r e p e a t ( b , 1 , h , w) mask = x HSV > t h r e s h o l d s mask = t o r c h . a l l ( mask , dim = 1 ) o v e r l a y [ mask ] = x [ mask ] r e t u r n o v e r l a y

to avoid this destabilization. Specifically, AugCL defines a curriculum over augmentations to enable better training and generalization. It is well known that CNNs are inherently biased to high-frequency features(Geirhos et al., 2018;Jo & Bengio, 2017), the consequences of this in RL is a lower average episodic reward in the train environment. AugCL uses weak augmentations early in training to regularize the CNN. Then later in training AugCL introduces strong augmentations to improve the robustness of the policy. We train two separate networks in parallel as we believe that strong augmentation incurs a non-zero degradation to policy learning, as shown in Section 5.3.

3, we analyze what hyperparameters are necessary for the benefits of AugCL. Results from DMC-GB benchmark color hard. All methods are evaluated on 5 seeds over 30 episodes. The mean and standard deviation are provided. AugCL outperforms baseline in 4 out the 5 tasks.

Results from DMC-GB benchmark video easy generalization benchmark. AugCL with the new splice augmentation outperforms prior work in 4 out of 5 of the tasks.

Hyperparameters used for all experiments

This table summarizes what we assessed as key differences and similarities between methods. "N/A" stands for "Not Applicable."

] for color space conversion. Hue represents color, saturation represents chromatic intensity and value represents brightness. If all values in a cell in the HSV converted image exceed the preset thresholds then they are imparted on a new image. In our case we splice out the agent and paste it onto a randomized background.Laskin et al.;Cetin et al. (2022) show weak augmentation leads to a better spatial attention mapping of features the agent can control like the robot over high frequency features like the background and flooring. Splice has the ability to impart human prior knowledge about the tasks through tuning the thresholds. By tuning the thresholds accordingly the user can parse out only the relevant visual features Train environment performance at the end of 500,000 train steps of AugCL with varying strong augmentations. The mean and standard deviation over 5 seeds where each seed is evaluated using the mean of 30 episodes done for each seed.

Walker, Walk Walker, Stand Ball in Cup, Catch Cartpole Swingup Finger, Spin Shift augmentation evaluation results on the train environment across 5 seeds with the mean and standard deviation across 30 episodes for each seed. This table serves as an upper bound to what can be achieved in generalized benchmarks.Swing up and balance an unactuated pole by applying forces to a cart at its base.The agent is rewarded for balancing the pole within a fixed threshold angle. DoF finger. The task is to continually spin a free body 2 No

Table containing: action space dimension, brief description of task and if rewards are dense. Descriptions are taken from Hansen et al. (2021b).

