IMAGE AUGMENTATION IS ALL YOU NEED: REGULARIZING DEEP REINFORCEMENT LEARNING FROM PIXELS

Abstract

Existing model-free reinforcement learning (RL) approaches are effective when trained on states but struggle to learn directly from image observations. We propose an augmentation technique that can be applied to standard model-free RL algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to transform input examples, as well as regularizing the value function and policy. Our approach reaches a new stateof-the-art performance on DeepMind control suite and Atari 100k benchmark, surpassing previous model-free (Haarnoja et al., 

1. INTRODUCTION

Sample-efficient deep reinforcement learning (RL) algorithms capable of directly training from image pixels would open up many real-world applications in control and robotics. However, simultaneously training a convolutional encoder alongside a policy network is challenging when given limited environment interaction, strong correlation between samples and a typically sparse reward signal. Limited supervision is a common problem across AI and two approaches are commonly taken: (i) training with an additional auxiliary losses, such as those based on self-supervised learning (SSL) and (ii) training with data augmentation. A wide range of auxiliary loss functions have been proposed to augment supervised objectives, e.g. weight regularization, noise injection (Hinton et al., 2012) , or various forms of auto-encoder (Kingma et al., 2014) . In RL, reconstruction losses (Jaderberg et al., 2017; Yarats et al., 2019) or SSL objectives (Dwibedi et al., 2018; Srinivas et al., 2020) are used. However, these objectives are unrelated to the task at hand, thus have no guarantee of inducing an appropriate representation for the policy network. SSL losses are highly effective in the large data regime, e.g. in domains such as vision (Chen et al., 2020; He et al., 2019) and NLP (Collobert et al., 2011; Devlin et al., 2018) where large (unlabeled) datasets are readily available. However, in sample-efficient RL, training data is more limited due to restricted interaction between the agent and the environment, limiting their effectiveness. Data augmentation methods are widely used in vision and speech domains, where output-invariant perturbations can easily be applied to the labeled input examples. Surprisingly, data augmentation has received little attention in the RL community. In this paper we propose augmentation approaches appropriate for sample-efficient RL and comprehensively evaluate them. The key idea of our approach is to use standard image transformations to perturb input observations, as well as regularizing the Q-function learned by the critic so that different transformations of the same input image have similar Q-function values. No further modifications to standard actor-critic algorithms are required. Our study is, to the best of our knowledge, the first careful examination of image augmentation in sample-efficient RL. The main contributions of the paper are as follows: (i) the first to demonstrate that data augmentation greatly improves performance when training model-free RL algorithms from images; (ii) introducing a natural way to exploit MDP structure through two mechanisms for regularizing the value function, in a manner that is generally applicable to model-free RL and (iii) setting a new state-of-the-art performance on the standard DeepMind control suite (Tassa et al., 2018) , closing the gap between learning from states, and Atari 100k (Kaiser et al., 2019) benchmarks.

2. RELATED WORK

Data Augmentation in Computer Vision Data augmentation via image transformations has been used to improve generalization since the inception of convolutional networks (Becker & Hinton, 1992; Simard et al., 2003; LeCun et al., 1989; Ciresan et al., 2011; Ciregan et al., 2012) . Following AlexNet (Krizhevsky et al., 2012) , they have become a standard part of training pipelines. For object classification tasks, the transformations are selected to avoid changing the semantic category, i.e. translations, scales, color shifts, etc. While a similar set of transformations are potentially applicable to control tasks, the RL context does require modifications to be made to the underlying algorithm. Data augmentation methods have also been used in the context of self-supervised learning. Dosovitskiy et al. (2016) use per-exemplar perturbations in a unsupervised classification framework. More recently, several approaches (Chen et al., 2020; He et al., 2019; Misra & van der Maaten, 2019) have used invariance to imposed image transformations in contrastive learning schemes, producing state-of-the-art results on downstream recognition tasks. By contrast, our scheme addresses control tasks, utilizing different types of invariance. Data Augmentation in RL In contrast to computer vision, data augmentation is rarely used in RL. Certain approaches implicitly adopt it, for example Levine et al. (2018) ; Kalashnikov et al. (2018) use image augmentation as part of the AlexNet training pipeline without analysing the benefits occurring from it, thus being overlooked in subsequent work. HER (Andrychowicz et al., 2017) exploits information about the observation space by goal and reward relabeling, which can be viewed as a way to perform data augmentation. Other work uses data augmentation to improve generalization in domain transfer (Cobbe et al., 2018) . However, the classical image transformations used in vision have not previously been shown to definitively help on standard RL benchmarks. Concurrent with our work, RAD (Laskin et al., 2020) performs an exploration of different data augmentation approaches, but is limited to transformations of the image alone, without the additional augmentation of the Q-function used in our approach. Moreover, RAD can be regarded as a special case of our algorithm. Multiple follow ups to our initial preprint appeared on ArXiv (Raileanu et al., 2020; Okada & Taniguchi, 2020) , using similar techniques on other tasks, thus supporting the effectiveness and generality of data augmentation in RL.

Continuous Control from Pixels

There are a variety of methods addressing the sample-efficiency of RL algorithms that directly learn from pixels. The most prominent approaches for this can be classified into two groups, model-based and model-free methods. The model-based methods attempt to learn the system dynamics in order to acquire a compact latent representation of high-dimensional observations to later perform policy search (Hafner et al., 2018; Lee et al., 2019; Hafner et al., 2019) . In contrast, the model-free methods either learn the latent representation indirectly by optimizing the RL objective (Barth-Maron et al., 2018; Abdolmaleki et al., 2018) or by employing auxiliary losses that provide additional supervision (Yarats et al., 2019; Srinivas et al., 2020; Sermanet et al., 2018; Dwibedi et al., 2018) . Our approach is complementary to these methods and can be combined with them to improve performance.

3. BACKGROUND

Reinforcement Learning from Images We formulate image-based control as an infinite-horizon partially observable Markov decision process (POMDP) (Bellman, 1957; Kaelbling et al., 1998) . An POMDP can be described as the tuple (O, A, p, r, γ), where O is the high-dimensional observation space (image pixels), A is the action space, the transition dynamics p = P r(o t |o ≤t , a t ) capture the probability distribution over the next observation o t given the history of previous observations o ≤t and current action a t , r : O × A → R is the reward function that maps the current observation and action to a reward r t = r(o ≤t , a t ), and γ ∈ [0, 1) is a discount factor. Per common practice (Mnih et al., 2013) , throughout the paper the POMDP is converted into an MDP (Bellman, 1957) by stacking several consecutive image observations into a state s t = {o t , o t-1 , o t-2 , . . .}. For simplicity we redefine the transition dynamics p = P r(s t |s t , a t ) and the reward function r t = r(s t , a t ). We then aim to find a policy π(a t |s t ) that maximizes the cumulative discounted return E π [ ∞ t=1 γ t r t |a t ∼ π(•|s t ), s t ∼ p(•|s t , a t ), s 1 ∼ p(•)]. Soft Actor-Critic The Soft Actor-Critic (SAC) (Haarnoja et al., 2018) learns a state-action value function Q θ , a stochastic policy π θ and a temperature α to find an optimal policy for an MDP (S, A, p, r, γ) by optimizing a γ-discounted maximum-entropy objective (Ziebart et al., 2008) . θ is used generically to denote the parameters updated through training in each part of the model. Deep Q-learning DQN (Mnih et al., 2013 ) also learns a convolutional neural net to approximate Q-function over states and actions. The main difference is that DQN operates on discrete actions spaces, thus the policy can be directly inferred from Q-values. In practice, the standard version of DQN is frequently combined with a set of refinements that improve performance and training stability, commonly known as Rainbow (van Hasselt et al., 2015) . For simplicity, the rest of the paper describes a generic actor-critic algorithm rather than DQN or SAC in particular. Further background on DQN and SAC can be found in Appendix A.

4.1. OPTIMALITY INVARIANT IMAGE TRANSFORMATIONS FOR Q FUNCTION

We first introduce a general framework for regularizing the value function through transformations of the input state. For a given task, we define an optimality invariant state transformation f : S × T → S as a mapping that preserves the Q-values Q(s, a) = Q(f (s, ν), a) for all s ∈ S, a ∈ A and ν ∈ T . where ν are the parameters of f (•), drawn from the set of all possible parameters T . One example of such transformations are the random image translations successfully applied in the previous section. For every state, the transformations allow the generation of several surrogate states with the same Q-values, thus providing a mechanism to reduce the variance of Q-function estimation. In particular, for an arbitrary distribution of states µ(•) and policy π, instead of using a single sample s * ∼ µ(•), a * ∼ π(•|s * ) estimation of the following expectation E s∼µ(•) a∼π(•|s) [Q(s, a)] ≈ Q(s * , a * ) we generate K samples via random transformations and obtain an estimate with lower variance E s∼µ(•) a∼π(•|s) [Q(s, a)] ≈ 1 K K k=1 Q(f (s * , ν k ), a k ) where ν k ∈ T and a k ∼ π(•|f (s * , ν k )). This suggests two distinct ways to regularize Q-function. First, we use the data augmentation to compute the target values for every transition tuple (s i , a i , r i , s i ) as y i = r i + γ 1 K K k=1 Q θ (f (s i , ν i,k ), a i,k ) where a i,k ∼ π(•|f (s i , ν i,k )) where ν i,k ∈ T corresponds to a transformation parameter of s i . Then the Q-function is updated using these targets through an SGD update using learning rate λ θ θ ← θ -λ θ ∇ θ 1 N N i=1 (Q θ (f (s i , ν i ), a i ) -y i ) 2 . (2) In tandem, we note that the same target from Equation (1) can be used for different augmentations of s i , resulting in the second regularization approach θ ← θ -λ θ ∇ θ 1 N M N,M i=1,m=1 (Q θ (f (s i , ν i,m ), a i ) -y i ) 2 . ( ) When both regularization methods are used, ν i,m and ν i,k are drawn independently.

4.2. PRACTICAL INSTANTIATION OF OPTIMALITY INVARIANT IMAGE TRANSFORMATION

A range of successful image augmentation techniques have been developed in computer vision (Ciregan et al., 2012; Ciresan et al., 2011; Simard et al., 2003; Krizhevsky et al., 2012; Chen et al., 2020) . These apply transformations to the input image for which the task labels are invariant, e.g. for object recognition tasks, image flips and rotations do not alter the semantic label. However, tasks in RL differ significantly from those in vision and in many cases the reward would not be preserved by these transformations. We examine image transformations from Chen et al. (2020) (random shifts, random cutouts, horizontal/vertical flips, rotations and intensity shifts) in Appendix E and conclude that random shifts strike a good balance between simplicity and performance, we therefore limit our choice of transformation function f (•) to random shifts. We apply shifts to the images sampled from the replay buffer. For example, images from the DeepMind control suite used in our experiments are 84 × 84. We pad each side by 4 pixels (by repeating boundary pixels) and then select a random 84 × 84 crop, yielding the original image shifted by ±4 pixels. This procedure is repeated every time an image is sampled from the replay buffer.

4.3. OUR APPROACH

: DATA-REGULARIZED Q (DRQ) Our approach, DrQ, is the union of the three separate regularization mechanisms introduced above: 1. transformations of the input image (Section 4.2). 2. averaging the Q target over K image transformations (Equation ( 1)). 3. averaging the Q function itself over M image transformations (Equation ( 3)). Algorithm 1 details how they are incorporated into a generic pixel-based off-policy actor-critic algorithm. Note that if [K=1,M=1] then DrQ reverts to image transformations alone, this makes applying DrQ to any model-free RL algorithm straightforward. For the experiments in this paper, we pair DrQ with SAC (Haarnoja et al., 2018) and DQN (Mnih et al., 2013) , popular model-free algorithms for control in continuous and discrete action spaces respectively. We select image shifts as the class of image transformations f , with ν ± 4, as explained in Section 4.2.

5.1. ABLATION EXPERIMENT

Figure 1 shows the effect of image shift augmentation applied to three tasks from the DeepMind control suite (Tassa et al., 2018) . Figure 1a shows unmodified SAC (Haarnoja et al., 2018) parameterized with different image encoders, taken from: NatureDQN (Mnih et al., 2013) , Dreamer (Hafner et al., 2019) , Impala (Espeholt et al., 2018) , SAC-AE (Yarats et al., 2019), and D4PG (Barth-Maron et al., 2018) . The encoders vary significantly in their architecture and capacity, with parameter Algorithm 1 DrQ: Data-regularized Q applied to a generic off-policy actor critic algorithm. Black: unmodified off-policy actor-critic. Orange: image transformation. Green: target Q augmentation. Blue: Q augmentation. Hyperparameters: Total number of environment steps T , mini-batch size N , learning rate λ θ , target network update rate τ , image transformation f , number of target Q augmentations K, number of Q augmentations M . for each timestep t = 1..T do a t ∼ π(•|s t ) s t ∼ p(•|s t , a t ) D ← D ∪ (s t , a t , r(s t , a t ), s t ) Add a transition to the replay buffer UPDATECRITIC(D) UPDATEACTOR(D) Data augmentation is applied to the samples for actor training as well end for procedure UPDATECRITIC(D) {(s i , a i , r i , s i )} N i=1 ∼ D Sample a mini batch from the replay buffer ν i,k ν i,k ∼ U(T ), i = 1..N, k = 1..K Uniformly sample target augmentations for each i = 1..N do a i ∼ π(•|s i ) or a i,k ∼ π(•| f (s i , ν i,k )), k = 1..K Qi = Q θ (s i , a i ) or Qi = 1 K K k=1 Q θ (f (s i , ν i,k ), a i,k ) y i ← r(s i , a i ) + γ Qi end for {ν i,m |ν i,m ∼ U(T ), i = 1..N, m = 1..M } Uniformly sample Q augmentations J Q (θ) = 1 N N i=1 (Q θ (s i , a i ) -y i ) 2 or J Q (θ) = 1 N M N,M i,m=1 (Q θ (f (s i , ν i,m ), a i ) -y i ) 2 θ ← θ -λ θ ∇ θ J Q (θ) Update the critic θ ← (1 -τ )θ + τ θ Update the critic target end procedure counts ranging from 220k to 2.4M. None of these train satisfactorily, with performance decreasing for the larger capacity models. Figure 1b shows SAC with the application of our random shifts transformation of the input images (i.e. just Section 4.2, not Q augmentation also). The results for all encoder architectures improve dramatically, suggesting that our method is general and can assist many different encoder architectures. To the best of our knowledge, this is the first successful demonstration of applying image augmentation on the standard benchmarks for continuous control. Furthermore, Figure 2 shows the full DrQ, with both image shifts and Q augmentation (Section 4.1), as well as ablated versions. Q augmentation provides additional consistent gain over image shift augmentation alone (full results are in Appendix F).

5.2. DEEPMIND CONTROL SUITE EXPERIMENTS

In this section we evaluate our algorithm (DrQ) on the two commonly used benchmarks based on the DeepMind control suite (Tassa et al., 2018) , namely the PlaNet (Hafner et al., 2018) and Dreamer (Hafner et al., 2019) setups. Throughout these experiments all hyper-parameters of the algorithm are kept fixed: the actor and critic neural networks are trained using the Adam optimizer (Kingma & Ba, 2014) with default parameters and a mini-batch size of 512foot_0 . For SAC, the soft target update rate τ is 0.01, initial temperature is 0.1, and target network and the actor updates are made every 2 critic updates (as in Yarats et al. (2019) ). We use the image encoder architecture from SAC-AE (Yarats et al., 2019) and follow their training procedure. The full set of parameters can be found in Appendix B. Following Henderson et al. (2018) , the models are trained using 10 different seeds; for every seed the mean episode returns are computed every 10000 environment steps, averaging over 10 episodes. All figures plot the mean performance over the 10 seeds, together with ± 1 standard deviation shading. We compare our DrQ approach to leading model-free and model-based  [K = 1, M = 1] DrQ [K = 2, M = 1] DrQ [K = 2, M = 2] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Environment Steps (×10  [K = 1, M = 1] DrQ [K = 2, M = 1] DrQ [K = 2, M = 2] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Environment Steps (×10 approaches: PlaNet (Hafner et al., 2018) , SAC-AE (Yarats et al., 2019) , SLAC (Lee et al., 2019) , CURL (Srinivas et al., 2020) and Dreamer (Hafner et al., 2019) . The comparisons use the results provided by the authors of the corresponding papers. [K = 1, M = 1] DrQ [K = 2, M = 1] DrQ [K = 2, M = 2] PlaNet Benchmark (Hafner et al., 2018) consists of six challenging control tasks from (Tassa et al., 2018) with different traits. The benchmark specifies a different action-repeat hyper-parameter for each of the six tasksfoot_1 . Following common practice (Hafner et al., 2018; Lee et al., 2019; Mnih et al., 2013) , we report the performance using true environment steps, thus are invariant to the action-repeat hyper-parameter. Aside from action-repeat, all other hyper-parameters of our algorithm are fixed across the six tasks, using the values previously detailed. Figure 3 compares DrQ [K=2,M=2] to PlaNet (Hafner et al., 2018) , SAC-AE (Yarats et al., 2019) , CURL (Srinivas et al., 2020) , SLAC (Lee et al., 2019) , and an upper bound performance provided by SAC (Haarnoja et al., 2018) that directly learns from internal states. We use the version of SLAC that performs one gradient update per an environment step to ensure a fair comparison to other approaches. DrQ achieves state-of-the-art performance on this benchmark on all the tasks, despite being much simpler than other methods. Furthermore, since DrQ does not learn a model (Hafner et al., 2018; Lee et al., 2019) or any auxiliary tasks (Srinivas et al., 2020) , the wall clock time also compares favorably to the other methods. In Table 1 we also compare performance given at a fixed number of environment interactions (e.g. 100k and 500k). Furthermore, in Appendix G we demonstrate that DrQ is robust to significant changes in hyper-parameter settings. Dreamer Benchmark is a more extensive testbed that was introduced in Dreamer (Hafner et al., 2019) , featuring a diverse set of tasks from the DeepMind control suite. Tasks involving sparse reward were excluded (e.g. Acrobot and Quadruped) since they require modification of SAC to incorporate multi-step returns (Barth-Maron et al., 2018) , which is beyond the scope of this work. We evaluate on the remaining 15 tasks, fixing the action-repeat hyper-parameter to 2 as in Hafner et al. (2019) . We compare DrQ [K=2,M=2] to Dreamer (Hafner et al., 2019) and the upper-bound performance of SAC (Haarnoja et al., 2018) from statesfoot_2 . Again, we keep all the hyper-parameters of our algorithm fixed across all the tasks. In Figure 4 , DrQ demonstrates the state-of-the-art results by collectively outperforming Dreamer (Hafner et al., 2019) , although Dreamer is superior on 3 of the 15 tasks (Walker Run, Cartpole Swingup Sparse and Pendulum Swingup). On many tasks DrQ approaches the upper-bound performance of SAC (Haarnoja et al., 2018) trained directly on states. Figure 4 : The Dreamer benchmark. Our method (DrQ [K=2,M=2]) again demonstrates superior performance over Dreamer on 12 out 15 selected tasks. In many cases it also reaches the upper-bound performance of SAC that learns directly from states.

5.3. ATARI 100K EXPERIMENTS

We evaluate DrQ [K=1,M=1] on the Atari 100k benchmark (Kaiser et al., 2019 ) -a sampleconstrained evaluation for discrete control algorithms. The underlying RL approach to which DrQ is applied is a DQN, combined with double Q-learning (van Hasselt et al., 2015) , n-step returns (Mnih et al., 2016) , and dueling critic architecture (Wang et al., 2015) . As per common practice (Kaiser et al., 2019; van Hasselt et al., 2019a) , we evaluate our agent for 125k environment steps at the end of training and average its performance over 5 random seeds. Figure 5 shows the median humannormalized episode returns performance (as in Mnih et al. (2013) ) of the underlying model, which we refer to as Efficient DQN, in pink. When DrQ is added there is a significant increase in performance (cyan), surpassing OTRainbow (Kielak, 2020) and Data Efficient Rainbow (van Hasselt et al., 2019a) . DrQ is also superior to CURL (Srinivas et al., 2020) 

Median human normalized returns

Figure 5 : The Atari 100k benchmark. Compared to a set of leading baselines, our method (DrQ [K=1,M=1], combined with Efficient DQN) achieves the state-of-the-art performance, despite being considerably simpler. Note the large improvement that results from adding DrQ to Efficient DQN (pink vs cyan). By contrast, the gains from CURL, that utilizes tricks from both Data Efficient Rainbow and OTRainbow, are more modest over the underlying RL methods.

6. CONCLUSION

We have introduced a regularization technique, based on image shifts and Q-function augmentation, that significantly improves the performance of model-free RL algorithms trained directly from images. In contrast to the concurrent work of Laskin et al. (2020) , which is a special case of DrQ, our method exploits the MDP structure of the problem, demonstrating gains over image augmentations alone. Our method is easy to implement and adds a negligible computational burden. We compared our method to state-of-the-art approaches on the DeepMind control suite, outperforming them on the majority of tasks and closing the gap with state-based training. On the Atari 100k benchmark DrQ outperforms other SOTA methods in the median metric. To the best of our knowledge, this is the first convincing demonstration of the utility of data augmentation on these standard benchmarks. Furthermore, we demonstrate the method to be robust to the choice of hyper-parameters.

APPENDIX A EXTENDED BACKGROUND

Reinforcement Learning from Images We formulate image-based control as an infinite-horizon partially observable Markov decision process (POMDP) (Bellman, 1957; Kaelbling et al., 1998) . An POMDP can be described as the tuple (O, A, p, r, γ), where O is the high-dimensional observation space (image pixels), A is the action space, the transition dynamics p = P r(o t |o ≤t , a t ) capture the probability distribution over the next observation o t given the history of previous observations o ≤t and current action a t , r : O × A → R is the reward function that maps the current observation and action to a reward r t = r(o ≤t , a t ), and γ ∈ [0, 1) is a discount factor. Per common practice (Mnih et al., 2013) , throughout the paper the POMDP is converted into an MDP (Bellman, 1957) by stacking several consecutive image observations into a state s t = {o t , o t-1 , o t-2 , . . .}. For simplicity we redefine the transition dynamics p = P r(s t |s t , a t ) and the reward function r t = r(s t , a t ). We then aim to find a policy π(a t |s t ) that maximizes the cumulative discounted return E π [ ∞ t=1 γ t r t |a t ∼ π(•|s t ), s t ∼ p(•|s t , a t ), s 1 ∼ p(•)]. Soft Actor-Critic The Soft Actor-Critic (SAC) (Haarnoja et al., 2018) learns a state-action value function Q θ , a stochastic policy π θ and a temperature α to find an optimal policy for an MDP (S, A, p, r, γ) by optimizing a γ-discounted maximum-entropy objective (Ziebart et al., 2008) . θ is used generically to denote the parameters updated through training in each part of the model. The actor policy π θ (a t |s t ) is a parametric tanh-Gaussian that given s t samples a t = tanh(µ θ (s t ) + σ θ (s t ) ), where ∼ N (0, 1) and µ θ and σ θ are parametric mean and standard deviation. The policy evaluation step learns the critic Q θ (s t , a t ) network by optimizing a single-step of the soft Bellman residual J Q (D) = E (st,at,s t )∼D a t ∼π(•|s t ) [(Q θ (s t , a t ) -y t ) 2 ] y t = r(s t , a t ) + γ[Q θ (s t , a t ) -α log π θ (a t |s t )], where D is a replay buffer of transitions, θ is an exponential moving average of the weights as done in (Lillicrap et al., 2015) . SAC uses clipped double-Q learning (van Hasselt et al., 2015; Fujimoto et al., 2018) , which we omit from our notation for simplicity but employ in practice. The policy improvement step then fits the actor policy π θ (a t |s t ) network by optimizing the objective J π (D) = E st∼D [D KL (π θ (•|s t )|| exp{ 1 α Q θ (s t , •)})]. Finally, the temperature α is learned with the loss J α (D) = E st∼D at∼π θ (•|st) [-α log π θ (a t |s t ) -α H], where H ∈ R is the target entropy hyper-parameter that the policy tries to match, which in practice is usually set to H = -|A|. Deep Q-learning DQN (Mnih et al., 2013 ) also learns a convolutional neural net to approximate Q-function over states and actions. The main difference is that DQN operates on discrete actions spaces, thus the policy can be directly inferred from Q-values. The parameters of DQN are updated by optimizing the squared residual error J Q (D) = E (st,at,s t )∼D [(Q θ (s t , a t ) -y t ) 2 ] y t = r(s t , a t ) + γ max a Q θ (s t , a ). In practice, the standard version of DQN is frequently combined with a set of tricks that improve performance and training stability, wildly known as Rainbow (van Hasselt et al., 2015) .

B THE DEEPMIND CONTROL SUITE EXPERIMENTS SETUP

Our PyTorch SAC (Haarnoja et al., 2018) implementation is based off of Yarats & Kostrikov (2020) .

B.1 ACTOR AND CRITIC NETWORKS

We employ clipped double Q-learning (van Hasselt et al., 2015; Fujimoto et al., 2018) for the critic, where each Q-function is parametrized as a 3-layer MLP with ReLU activations after each layer except of the last. The actor is also a 3-layer MLP with ReLUs that outputs mean and covariance for the diagonal Gaussian that represents the policy. The hidden dimension is set to 1024 for both the critic and actor.

B.2 ENCODER NETWORK

We employ an encoder architecture from Yarats et al. (2019) . This encoder consists of four convolutional layers with 3 × 3 kernels and 32 channels. The ReLU activation is applied after each conv layer. We use stride to 1 everywhere, except of the first conv layer, which has stride 2. The output of the convnet is feed into a single fully-connected layer normalized by LayerNorm (Ba et al., 2016) . Finally, we apply tanh nonlinearity to the 50 dimensional output of the fully-connected layer. We initialize the weight matrix of fully-connected and convolutional layers with the orthogonal initialization (Saxe et al., 2013) and set the bias to be zero. The actor and critic networks both have separate encoders, although we share the weights of the conv layers between them. Furthermore, only the critic optimizer is allowed to update these weights (e.g. we stop the gradients from the actor before they propagate to the shared conv layers).

B.3 TRAINING AND EVALUATION SETUP

Our agent first collects 1000 seed observations using a random policy. The further training observations are collected by sampling actions from the current policy. We perform one training update every time we receive a new observation. In cases where we use action repeat, the number of training observations is only a fraction of the environment steps (e.g. a 1000 steps episode at action repeat 4 will only results into 250 training observations). We evaluate our agent every 10000 true environment steps by computing the average episode return over 10 evaluation episodes. During evaluation we take the mean policy action instead of sampling.

B.4 PLANET AND DREAMER BENCHMARKS

We consider two evaluation setups that were introduced in PlaNet (Hafner et al., 2018) and Dreamer (Hafner et al., 2019) , both using tasks from the DeepMind control suite (Tassa et al., 2018) . The PlaNet benchmark consists of six tasks of various traits. Importantly, the benchmark proposed to use a different action repeat hyper-parameter for each task, which we summarize in Table 2 . The Dreamer benchmark considers an extended set of tasks, which makes it more difficult that the PlaNet setup. Additionally, this benchmark requires to use the same set hyper-parameters for each task, including action repeat (set to 2), which further increases the difficulty. We construct an observational input as an 3-stack of consecutive frames (Mnih et al., 2013) , where each frame is a RGB rendering of size 84 × 84 from the 0th camera. We then divide each pixel by 255 to scale it down to [0, 1] range.

B.6 OTHER HYPER PARAMETERS

Due to computational constraints for all the continuous control ablation experiments in the main paper and appendix we use a minibatch size of 128, while for the main results we use minibatch of size 512. In Table 3 we provide a comprehensive overview of all the other hyper-parameters. 

C THE ATARI 100K EXPERIMENTS SETUP

For ease of reproducibility in Table 4 we report the hyper-parameter settings used in the Atari 100k experiments. We largely reuse the hyper-parameters from OTRainbow (Kielak, 2020), but adapt them for DQN (Mnih et al., 2013) . Per common practise, we average performance of our agent over 5 random seeds. The evaluation is done for 125k environment steps at the end of training for 100k environment steps. 

D FULL ATARI 100K RESULTS

Besides reporting in Figure 5 median human-normalized episode returns over the 26 Atari games used in (Kaiser et al., 2019) , we also provide the mean episode return for each individual game in Table 5 . Table 5 : Mean episode returns on each of 26 Atari games from the setup in Kaiser et al. (2019) . The results are recorded at the end of training and averaged across 5 random seeds (the CURL's results are averaged over 3 seeds as reported in Srinivas et al. (2020) ). On each game we mark as bold the highest score. Our method demonstrates better overall performance (as reported in Figure 5 ). 

E IMAGE AUGMENTATIONS ABLATION

Following (Chen et al., 2020) , we evaluate popular image augmentation techniques, namely random shifts, cutouts, vertical and horizontal flips, random rotations and imagewise intensity jittering. Below, we provide a comprehensive overview of each augmentation. Furthermore, we examine effectiveness of these techniques in Figure 6 . Random Shift We bring our attention to random shifts that are commonly used to regularize neural networks trained on small images (Becker & Hinton, 1992; Simard et al., 2003; LeCun et al., 1989; Ciresan et al., 2011; Ciregan et al., 2012) . In our implementation of this method images of size 84 × 84 are padded each side by 4 pixels (by repeating boundary pixels) and then randomly cropped back to the original 84 × 84 size. Cutout Cutouts introduced in DeVries & Taylor (2017) represent a generalization of Dropout (Hinton et al., 2012) . Instead of masking individual pixels cutouts mask square regions. Since image pixels can be highly correlated, this technique is proven to improve training of neural networks. Horizontal/Vertical Flip This technique simply flips an image either horizontally or vertically with probability 0.1.

Rotate

Here, an image is rotated by r degrees, where r is uniformly sampled from [-5, -5] . Intensity Each N × C × 84 × 84 image tensor is multiplied by a single scalar s, which is computed as s = µ + σ • clip(r, -2, 2), where r ∼ N (0, 1). For our experiments we use µ = 1.0 and σ = 0.1. Figure 6 : Various image augmentations have different effect on the agent's performance. Overall, we conclude that using image augmentations helps to fight overfitting. Moreover, we notice that random shifts proven to be the most effective technique for tasks from the DeepMind control suite. Implementation Finally, we provide Python-like implementation for the aforementioned augmentations powered by Kornia (Riba et al., 2020) . 



Note that DrQ does not utilize additional information beyond transitions sampled from the replay buffer (i.e. does not use more observations per mini-batch), thus the mini-batch size is the same as for unmodified SAC. This means the number of training observations is a fraction of the environment steps (e.g. an episode of 1000 steps with action-repeat 4 results in 250 training observations). No other publicly reported results are available for the other methods due to the recency of the Dreamer(Hafner et al., 2019) benchmark.



Figure1: The performance of SAC trained from pixels on the DeepMind control suite using image encoder networks of different capacity (network architectures taken from recent RL algorithms, with parameter count indicated). (a): unmodified SAC. Task performance can be seen to get worse as the capacity of the encoder increases. For Walker Walk (right), all architectures provide mediocre performance, demonstrating the inability of SAC to train directly from pixels on harder problems. (b): SAC combined with image augmentation in the form of random shifts. The task performance is now similar for all architectures, regardless of their capacity, which suggests the generality of our method. There is also a clear performance improvement relative to (a), particularly for the more challenging Walker Walk task.

Figure 2: Different combinations of our three regularization techniques on tasks from (Tassa et al., 2018) using SAC. Black: standard SAC. Blue: DrQ [K=1,M=1], SAC augmented with random shifts. Red: DrQ [K=2,M=1], random shifts + Target Q augmentations. Purple: DrQ [K=2,M=2], random shifts + Target Q + Q augmentations. All three regularization methods correspond to Algorithm 1 with different K,M showing clear gains when both Target Q and Q augmentations are used.

.randn((x.size(0), 1, 1, 1), device=x.device) noise = 1.0 + (self.scale * r.clamp(-2.0, 2.0)) return x * noise

Figure3: The PlaNet benchmark. Our algorithm (DrQ [K=2,M=2]) outperforms the other methods and demonstrates the state-of-the-art performance. Furthermore, on several tasks DrQ is able to match the upper-bound performance of SAC trained directly on internal state, rather than images. Finally, our algorithm not only shows improved sample-efficiency relative to other approaches, but is also faster in terms of wall clock time.

The PlaNet benchmark at 100k and 500k environment steps. Our method (DrQ [K=2,M=2]) outperforms other approaches in both the data-efficient (100k) and asymptotic performance (500k) regimes. Random shifts only version (e.g. DrQ [K=1,M=1]) has a competitive performance but is consistently inferior to DrQ [K=2,M=2], particularly for 100k steps. We emphasize, that both versions of DrQ use exactly the same number of interactions with both the environment and replay buffer. Note that DrQ [K=1,M=1] is almost identical to RAD(Laskin et al., 2020), modulo some hyper-parameter differences.

The action repeat hyper-parameter used for each task in the PlaNet benchmark.

An overview of used hyper-parameters in the DeepMind control suite experiments.

A complete overview of hyper parameters used in the Atari 100k experiments.

F K AND M HYPER-PARAMETERS ABLATION

We further ablate the K,M hyper-parameters from Algorithm 1 to understand their effect on performance. In Figure 7 we observe that increase values of K,M improves the agent's performance. We choose to use the [K=2,M=2] parametrization as it strikes a good balance between performance and computational demands. 

Reacher Easy

Agent 

G ROBUSTNESS INVESTIGATION

To demonstrate the robustness of our approach (Henderson et al., 2018) , we perform a comprehensive study on the effect different hyper-parameter choices have on performance. A review of prior work (Hafner et al., 2018; 2019; Lee et al., 2019; Srinivas et al., 2020) shows consistent values for discount γ = 0.99 and target update rate τ = 0.01 parameters, but variability on network architectures, mini-batch sizes, learning rates. Since our method is based on SAC (Haarnoja et al., 2018) , we also check whether the initial value of the temperature is important, as it plays a crucial role in the initial phase of exploration. We omit search over network architectures since Figure 1b shows our method to be robust to the exact choice. We thus focus on three hyper-parameters: mini-batch size, learning rate, and initial temperature.Due to computational demands, experiments are restricted to a subset of tasks from Figure 8 : A robustness study of our algorithm (DrQ) to changes in mini-batch size, learning rate, and initial temperature hyper-parameters on three different tasks from (Tassa et al., 2018) . Each row corresponds to a different mini-batch size. The low variance of the curves and heat-maps shows DrQ to be generally robust to exact hyper-parameter settings.Published as a conference paper at ICLR 2021 Figure 8 shows performance curves for each configuration as well as a heat map over the mean performance of the final evaluation episodes, similar to Mnih et al. (2016) . Our method demonstrates good stability and is largely invariant to the studied hyper-parameters. We emphasize that for simplicity the experiments in Section 5 use the default learning rate of Adam (Kingma & Ba, 2014) (0.001), even though it is not always optimal.H IMPROVED DATA-EFFICIENT REINFORCEMENT LEARNING FROM PIXELSOur method allows to generate many various transformations from a training observation due to the data augmentation strategy. Thus, we further investigate whether performing more training updates per an environment step can lead to even better sample-efficiency. Following van Hasselt et al. (2019b) we compare a single update with a mini-batch of 512 transitions with 4 updates with 4 different mini-batches of size 128 samples each. Performing more updates per an environment step leads to even worse over-fitting on some tasks without data augmentation (see Figure 9a ), while our method DrQ, that takes advantage of data augmentation, demonstrates improved sample-efficiency (see Figure 9b ). 

