DATA-EFFICIENT REINFORCEMENT LEARNING WITH SELF-PREDICTIVE REPRESENTATIONS

Abstract

While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations (SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. We've made the code associated with this work available at https://github.com/mila-iqia/spr.

1. INTRODUCTION

Deep Reinforcement Learning (deep RL, François-Lavet et al., 2018) has proven to be an indispensable tool for training successful agents on difficult sequential decision-making problems (Bellemare et al., 2013; Tassa et al., 2018) . The success of deep RL is particularly noteworthy in highly complex, strategic games such as StarCraft (Vinyals et al., 2019) and DoTA2 (OpenAI et al., 2019) , where deep RL agents now surpass expert human performance in some scenarios. Deep RL involves training agents based on large neural networks using large amounts of data (Sutton, 2019) , a trend evident across both model-based (Schrittwieser et al., 2020) and model-free (Badia et al., 2020) learning. The sample complexity of such state-of-the-art agents is often incredibly high: MuZero (Schrittwieser et al., 2020) and Agent-57 (Badia et al., 2020) use 10-50 years of experience per Atari game, and OpenAI Five (OpenAI et al., 2019) uses 45,000 years of experience to accomplish its remarkable performance. This is clearly impractical: unlike easily-simulated environments such as video games, collecting interaction data for many real-world tasks is costly, making improved data efficiency a prerequisite for successful use of deep RL in these settings (Dulac-Arnold et al., 2019) . (Kaiser et al., 2019) , averaged over 10 random seeds for SPR, and 5 seeds for most other methods except CURL, which uses 20. Each method is allowed access to only 100k environment steps or 400k frames per game. (*) indicates that the method uses data augmentation. SPR achieves state-of-art results on both mean and median human-normalized scores. Note that, even without data augmentation, SPR still outperforms all prior methods on both metrics. Meanwhile, new self-supervised representation learning methods have significantly improved data efficiency when learning new vision and language tasks, particularly in low data regimes or semisupervised learning (Xie et al., 2019; Hénaff et al., 2019; Chen et al., 2020b) . Self-supervised methods improve data efficiency by leveraging a nearly limitless supply of training signal from tasks generated on-the-fly, based on "views" drawn from the natural structure of the data (e.g., image patches, data augmentation or temporal proximity, see Doersch et al., 2015; Oord et al., 2018; Hjelm et al., 2019; Tian et al., 2019; Bachman et al., 2019; He et al., 2020; Chen et al., 2020a) . Motivated by successes in semi-supervised and self-supervised learning (Tarvainen & Valpola, 2017; Xie et al., 2019; Grill et al., 2020) , we train better state representations for RL by forcing representations to be temporally predictive and consistent when subject to data augmentation. Specifically, we extend a strong model-free agent by adding a dynamics model which predicts future latent representations provided by a parameter-wise exponential moving average of the agent itself. We also add data augmentation to the future prediction task, which enforces consistency across different views of each observation. Contrary to some methods (Kaiser et al., 2019; Hafner et al., 2019) , our dynamics model operates entirely in the latent space and does not rely on reconstructing raw states. We evaluate our method, which we call Self-Predictive Representations (SPR), on the 26 games in the Atari 100k benchmark (Kaiser et al., 2019) , where agents are allowed only 100k steps of environment interaction (producing 400k frames of input) per game, which roughly corresponds to two hours of real-time experience. Notably, the human experts in Mnih et al. (2015) and Van Hasselt et al. (2016) were given the same amount of time to learn these games, so a budget of 100k steps permits a reasonable comparison in terms of data efficiency. In our experiments, we augment a modified version of Data-Efficient Rainbow (DER) (van Hasselt et al., 2019) with the SPR loss, and evaluate versions of SPR with and without data augmentation. We find that each version is superior to controlled baselines. When coupled with data augmentation, SPR achieves a median score of 0.415, which is a state-of-the-art result on this benchmark, outperforming prior methods by a significant margin. Notably, SPR also outperforms human expert scores on 7 out of 26 games while using roughly the same amount of in-game experience.

2. METHOD

We consider reinforcement learning (RL) in the standard Markov Decision Process (MDP) setting where an agent interacts with its environment in episodes, each consisting of sequences of observations, actions and rewards. We use s t , a t and r t to denote the state, the action taken by the agent and the reward received at timestep t. We seek to train an agent whose expected cumulative reward in each episode is maximized. To do this, we combine a strong model-free RL algorithm, Rainbow (Hessel Representations from the online encoder are used in the reinforcement learning task and for prediction of future representations from the target encoder via the transition model. The target encoder and projection head are defined as an exponential moving average of their online counterparts and are not updated via gradient descent. For brevity, we illustrate only the k th step of future prediction, but in practice we compute the loss over all steps from 1 to K. Note: our implementation for this paper includes g o in the Q-learning head. et al., 2018), with Self-Predictive Representations as an auxiliary loss to improve sample efficiency. We now describe our overall approach in detail.

2.1. DEEP Q-LEARNING

We focus on the Atari Learning Environment (Bellemare et al., 2013) , a challenging setting where the agent takes discrete actions while receiving purely visual, pixel-based observations. A prominent method for solving Atari, Deep Q Networks (Mnih et al., 2015) , trains a neural network Q θ to approximate the agent's current Q-function (policy evaluation) while updating the agent's policy greedily with respect to this Q-function (policy improvement). This involves minimizing the error between predictions from Q θ and a target value estimated by Q ξ , an earlier version of the network: L DQN θ = Q θ (s t , a t ) -(r t + γ max a Q ξ (s t+1 , a)) 2 . (1) Various improvements have been made over the original DQN: Distributional RL (Bellemare et al., 2017) models the full distribution of future reward rather than just the mean, Dueling DQN (Wang et al., 2016) decouples the value of a state from the advantage of taking a given action in that state, Double DQN (Van Hasselt et al., 2016) modifies the Q-learning update to avoid overestimation due to the max operation, among many others. Rainbow (Hessel et al., 2018) consolidates these improvements into a single combined algorithm and has been adapted to work well in data-limited regimes (van Hasselt et al., 2019) .

2.2. SELF-PREDICTIVE REPRESENTATIONS

For our auxiliary loss, we start with the intuition that encouraging state representations to be predictive of future states given future actions should improve the data efficiency of RL algorithms. Let (s t:t+K , a t:t+K ) denote a sequence of K + 1 previously experienced states and actions sampled from a replay buffer, where K is the maximum number of steps into the future which we want to predict. Our method has four main components which we describe below: • Online and Target networks: We use an online encoder f o to transform observed states s t into representations z t f o (s t ). We use these representations in an objective that encourages them to be predictive of future observations up to some fixed temporal offset K, given a sequence of K actions to perform. We augment each observation s t independently when using data augmentation. Rather than predicting representations produced by the online encoder, we follow prior work (Tarvainen & Valpola, 2017; Grill et al., 2020) by computing target representations for future states using a target encoder f m , whose parameters are an exponential moving average (EMA) of the online encoder parameters. Denoting the parameters of f o as θ o , those of f m as θ m , and the EMA coefficient as τ ∈ [0, 1), the update rule for θ m is: θ m ← τ θ m + (1 -τ )θ o . The target encoder is not updated via gradient descent. The special case τ = 0, θ m = θ o is noteworthy, as it performs well when regularization is already provided by data augmentation. • Transition Model: For the prediction objective, we generate a sequence of K predictions ẑt+1:t+K of future state representations zt+1:t+K using an action-conditioned transition model h. We compute ẑt+1:t+K iteratively: ẑt+k+1 h(ẑ t+k , a t+k ), starting from ẑt z t f o (s t ). We compute zt+1:t+K by applying the target encoder f m to the observed future states s t+1:t+K : zt+k f m (s t+k ). The transition model and prediction loss operate in the latent space, thus avoiding pixel-based reconstruction objectives. We describe the architecture of h in section 2.3. • Projection Heads: We use online and target projection heads g o and g m (Chen et al., 2020a) to project online and target representations to a smaller latent space, and apply an additional prediction head q (Grill et al., 2020) to the online projections to predict the target projections: ŷt+k q(g o (ẑ t+k )), ∀ẑ t+k ∈ ẑt+1:t+K ; ỹt+k g m (z t+k ), ∀z t+k ∈ zt+1:t+K . (3) The target projection head parameters are given by an EMA of the online projection head parameters, using the same update as the online and target encoders. • Prediction Loss: We compute the future prediction loss for SPR by summing over cosine similaritiesfoot_0 between the predicted and observed representations at timesteps t + k for 1 ≤ k ≤ K: L SPR θ (s t:t+K , a t:t+K ) = - K k=1 ỹt+k ||ỹ t+k || 2 ŷt+k ||ŷ t+k || 2 , where ỹt+k and ŷt+k are computed from (s t:t+K , a t:t+K ) as we just described. We call our method Self-Predictive Representations (SPR), following the predictive nature of the objective and the use of an exponential moving average target network similar to (Tarvainen & Valpola, 2017; He et al., 2020) . During training, we combine the SPR loss with the Q-learning loss for Rainbow. The SPR loss affects f o , g o , q and h. The Q-learning loss affects f o and the Q-learning head, which contains additional layers specific to Rainbow. Denoting the Q-learning loss from Rainbow as L RL θ , our full optimization objective is: L total θ = L RL θ + λL SPR θ . Unlike some other proposed methods for representation learning in reinforcement learning (Srinivas et al., 2020) , SPR can be used with or without data augmentation, including in contexts where data augmentation is unavailable or counterproductive. Moreover, compared to related work on contrastive representation learning, SPR does not use negative samples, which may require careful design of contrastive tasks, large batch sizes (Chen et al., 2020a) , or the use of a buffer to emulate large batch sizes (He et al., 2020) 

2.3. TRANSITION MODEL ARCHITECTURE

For the transition model h, we apply a convolutional network directly to the 64 × 7 × 7 spatial output of the convolutional encoder f o . The network comprises two 64-channel convolutional layers with 3 × 3 filters, with batch normalization (Ioffe & Szegedy, 2015) after the first convolution and ReLU nonlinearities after each convolution. We append a one-hot vector representing the action taken to each location in the input to the first convolutional layer, similar to Schrittwieser et al. (2020) . We use a maximum prediction depth of K = 5, and we truncate calculation of the SPR loss at episode boundaries to avoid encoding environment reset dynamics into the model.

Algorithm 1: Self-Predictive Representations

Denote parameters of online encoder f o and projection g o as θ o Denote parameters of target encoder f m and projection g m as θ m Denote parameters of transition model h, predictor q and Q-learning head as φ Denote the maximum prediction depth as K, batch size as N initialize replay buffer B while Training do collect experience (s, a, r, s ) with (θ o , φ) and add to buffer B sample a minibatch of sequences of (s, a, r, s ) ∼ B for i in range(0, N ) do if augmentation then s i ← augment(s i ); s i ← augment(s i ) end z i 0 ← f θ (s i 0 ) // online representations l i ← 0 for k in (1, . . . , K) do ẑi k ← h(ẑ i k-1 , a i k-1 ) // latent states via transition model zi k ← f m (s i k ) // target representations ŷi k ← q(g o (ẑ i k )), ỹi k ← g m (z i k ) // projections l i ← l i - ỹi k ||ỹ i k ||2 ŷi k ||ŷ i k ||2 // SPR loss at step k end l i ← λl i + RL loss(s i , a i , r i , s i ; θ o ) // Add RL loss for batch with θ o end l ← 1 N N i=0 l i // average loss over minibatch θ o , φ ← optimize((θ o , φ), l) // update online parameters θ m ← τ θ o + (1 -τ )θ m // update target parameters end 2.4 DATA AUGMENTATION When using augmentation, we use the same set of image augmentations as in DrQ from Yarats et al. ( 2021), consisting of small random shifts and color jitter. We normalize activations to lie in [0, 1] at the output of the convolutional encoder and transition model, as in Schrittwieser et al. (2020) . We use Kornia (Riba et al., 2020) for efficient GPU-based data augmentations. When not using augmentation, we find that SPR performs better when dropout (Srivastava et al., 2014) with probability 0.5 is applied at each layer in the online and target encoders. This is consistent with Laine & Aila (2017); Tarvainen & Valpola (2017) , who find that adding noise inside the network is important when not using image-specific augmentation, as proposed by Bachman et al. (2014) .

2.5. IMPLEMENTATION DETAILS

For our Atari experiments, we largely follow van Hasselt et al. (2019) for DQN hyperparameters, with four exceptions. We follow DrQ (Yarats et al., 2021) by: using the 3-layer convolutional encoder from Mnih et al. (2015) , using 10-step returns instead of 20-step returns for Q-learning, and not using a separate DQN target network when using augmentation. We also perform two gradient steps per environment step instead of one. We show results for this configuration with and without augmentation in Table 5 , and confirm that these changes are not themselves responsible for our performance. We reuse the first layer of the DQN MLP head as the SPR projection head g o . When using dueling DQN (Wang et al., 2016) , g o concatenates the outputs of the first layers of the value and advantage heads. When these layers are noisy (Fortunato et al., 2018) , g o does not use the noisy parameters. Finally, we parameterize the predictor q as a linear layer. We use τ = 0.99 when augmentation is disabled and τ = 0 when enabled. For L total θ = L RL θ + λL SPR θ , we use λ = 2. Hyperparameters were tuned over a subset of games (following Mnih et al., 2015; Machado et al., 2018) . We list the complete hyperparameters in Table 3 . Our implementation uses rlpyt (Stooke & Abbeel, 2019) and PyTorch (Paszke et al., 2019) . We find that SPR modestly increases the time required for training, which we discuss in more detail in Appendix D. 3 RELATED WORK et al., 2018; Laskin et al., 2020) in multi-task and transfer settings. We show that data augmentation can be more effectively leveraged in reinforcement learning by forcing representations to be consistent between different augmented views of an observation while also predicting future latent states.

3.2. REPRESENTATION LEARNING IN RL:

Representation learning has a long history of use in RL -see Lesort et al. (2018) . For example, CURL (Srinivas et al., 2020) proposed a combination of image augmentation and a contrastive loss to perform representation learning for RL. However, follow-up results from RAD (Laskin et al., 2020) suggest that most of the benefits of CURL come from image augmentation, not its contrastive loss. CPC (Oord et al., 2018) , CPC|Action (Guo et al., 2018) , ST-DIM (Anand et al., 2019) and DRIML (Mazoure et al., 2020) propose to optimize various temporal contrastive losses in reinforcement learning environments. We perform an ablation comparing such temporal contrastive losses to our method in Section 5. Kipf et al. (2019) propose to learn object-oriented contrastive representations by training a structured transition model based on a graph neural network. SPR bears some resemblance to DeepMDP (Gelada et al., 2019) , which trains a transition model with an unnormalized L2 loss to predict representations of future states, along with a reward prediction objective. However, DeepMDP uses its online encoder to generate prediction targets rather than employing a target encoder, and is thus prone to representational collapse (sec. C.5 in Gelada et al. ( 2019)). To mitigate this issue, DeepMDP relies on an additional observation reconstruction objective. In contrast, our model is self-supervised, trained entirely in the latent space, and uses a normalized loss. Our ablations (sec. 5) demonstrate that using a target encoder has a large impact on our method. SPR is also similar to PBL (Guo et al., 2020) , which directly predicts representations of future states. However, PBL uses two separate target networks trained via gradient descent, whereas SPR uses a single target encoder, updated without backpropagation. Moreover, PBL studies multitask generalization in the asymptotic limits of data, whereas SPR is concerned with single-task performance in low data regimes, using 0.01% as much data as PBL. Unlike PBL, SPR additionally enforces consistency across augmentations, which empirically provides a large boost in performance.

4. RESULTS

We test SPR on the sample-efficient Atari setting introduced by Kaiser et 

4.1. EVALUATION

We evaluate the performance of different methods by computing the average episodic return at the end of training. We normalize scores with respect to expert human scores to account for different scales of scores in each game, as done in previous works. The human-normalized score of an agent on a game is calculated as agent score-random score human score-random score and aggregated across the 26 games by mean or median. We find that human scores on some games are so high that differences between methods are washed out by normalization, making it hard for these games to influence aggregate metrics. Moreover, we find that the median score is typically only influenced by a handful of games. Both these factors compound together to make the median human-normalized score an unreliable metric for judging overall performance. To address this, we also report DQN-normalized scores, defined analogously to human-normalized scores and calculated using scores from DQN agents (Mnih et al., 2015) trained over 50 million steps, and report both mean and median of those metrics in all results and ablations, and plot the distribution of scores over all the games in Figure 3 . 2018) that comparisons based on small numbers of random seeds are unreliable, we average our results over ten random seeds, twice as many as most previous works.

5. ANALYSIS

The target encoder We find that using a separate target encoder is vital in all cases. A variant of SPR in which target representations are generated by the online encoder without a stopgradient (as done by e.g., Gelada et al., 2019) exhibits catastrophically reduced performance, with median human-normalized score of 0.278 with augmentation versus 0.415 for SPR. However, there is more flexibility in the EMA constant used for the target encoder. When using augmentation, a value of τ = 0 performs best, while without augmentation we use τ = 0.99. The success of τ = 0 is interesting, since the related method BYOL reports very poor representation learning performance in this case. We hypothesize that optimizing a reinforcement learning objective in parallel with the SPR loss explains this difference, as it provides an additional gradient which discourages representational collapse. Full results for these experiments are presented in Appendix C. Dynamics modeling is key A key distinction between SPR and other recent approaches leveraging representation learning for reinforcement learning, such as CURL (Srinivas et al., 2020) and DRIML (Mazoure et al., 2020) , is our use of an explicit multistep dynamics model. To illustrate the impact of dynamics modeling, we test SPR with a variety of prediction depths K. Two of these ablations, one with no dynamics modeling and one that models only a single step of dynamics, are presented in Table 2 (as Non-temporal SPR and 1-step SPR), and all are visualized in Figure 4 . We find that extended dynamics modeling consistently improves performance up to roughly K = 5. Moving beyond this continues to improve performance on a subset of games, at the cost of increased computation. Note that the non-temporal ablation we test is similar to using BYOL (Grill et al., 2020) as an auxiliary task, with particular architecture choices made for the projection layer and predictor. Comparison with contrastive losses Though many recent works in representation learning employ contrastive learning, we find that SPR consistently outperforms both temporal and non-temporal variants of contrastive losses (see Table 6 , appendix), including CURL (Srinivas et al., 2020) . Using a quadratic loss causes collapse SPR's use of a cosine similarity objective (or a normalized L2 loss) sets it in contrast to some previous works, such as DeepMDP (Gelada et al., 2019) , which have learned latent dynamics models by minimizing an un-normalized L2 loss over predictions of future latents. To examine the importance of this objective, we test a variant of SPR that minimizes un-normalized L2 loss (Quadratic SPR in Table 2 ), and find that it performs only slightly better than random. This is consistent with results from Gelada et al. (2019) , who find that DeepMDP's representations are prone to collapse, and use an auxiliary reconstruction objective to prevent this. Projections are critical Another distinguishing feature of SPR is its use of projection and prediction networks. We test a variant of SPR that uses neither, instead computing the SPR loss directly over the 64 × 7 × 7 convolutional feature map used by the transition model (SPR without projections in Table 2 ). We find that this variant has inferior performance, and suggest two possible explanations. First, the convolutional network represents only a small fraction of the capacity of SPR's network, containing only some 80,000 parameters out of a total of three to four million. Employing the first layer of the DQN head as a projection thus allows the SPR objective to affect far more of the network, while in this variant its impact is limited. Second, the effects of SPR in forcing invariance to augmentation may be undesirable at this level; as the convolutional feature map is the product of only three layers, it may be challenging to learn features that are simultaneously rich and invariant.

6. FUTURE WORK

Recent work in both visual (Chen et al., 2020b) and language representation learning (Brown et al., 2020) has suggested that self-supervised models trained on large datasets perform exceedingly well on downstream problems with limited data, often outperforming methods trained using only task-specific data. Future works could similarly exploit large corpora of unlabelled data, perhaps from multiple MDPs or raw videos, to further improve the performance of RL methods in low-data regimes. As the SPR objective is unsupervised, it could be directly applied in such settings. Another interesting direction is to use the transition model learned by SPR for planning. MuZero (Schrittwieser et al., 2020) has demonstrated that planning with a model supervised via reward and value prediction can work extremely well given sufficient (massive) amounts of data. It remains unclear whether such models can work well in low-data regimes, and whether augmenting such models with self-supervised objectives such as SPR can improve their data efficiency. It would also be interesting to examine whether self-supervised methods like SPR can improve generalization to unseen tasks or changes in environment, similar to how unsupervised pretraining on ImageNet can generalize to other datasets (He et al., 2020; Grill et al., 2020) .

7. CONCLUSION

In this paper we introduced Self-Predictive Representations (SPR), a self-supervised representation learning algorithm designed to improve the data efficiency of deep reinforcement learning agents. SPR learns representations that are both temporally predictive and consistent across different views of environment observations, by directly predicting representations of future states produced by a target encoder. SPR achieves state-of-the-art performance on the 100k steps Atari benchmark, demonstrating significant improvements over prior work. Our experiments show that SPR is highly robust, and is able to outperform the previous state of the art when either data augmentation or temporal prediction is disabled. We identify important directions for future work, and hope continued research at the intersection of self-supervised learning and reinforcement learning leads to algorithms which rival the efficiency and robustness of humans. 

A ATARI DETAILS

We provide a full set of hyperparameters used in both the augmentation and no-augmentation cases in Table 3 , including new hyperparameters for SPR. To ensure that the minor hyper-parameter changes we make to the DER baseline are not solely responsible for our improved performance, we perform controlled experiments using the same hyperparameters and same random seeds for baselines. We find that our controlled Rainbow implementation without augmentation is slightly stronger than Data-Efficient Rainbow but comparable to Overtrained Rainbow (Kielak, 2020), while with augmentation enabled our results are somewhat stronger than DrQ. 2 None of these methods, however, are close to the performance of SPR. To illustrate the influence of the EMA constant τ , we evaluate τ at 9 values logarithmically interpolatedfoot_3 between 0.999 and 0 on a subset of 10 Atari games. 4 We use 10 seeds per game, and evaluate SPR both with and without augmentation; parameters other than τ are identical to those listed in Table 3 . To equalize the importance of games in this analysis, we normalize by the average score across all tested values of τ for each game to calculate a self-normalized score, as score sns agent score-random score average score-random score . We test SP R both with and without augmentation, and calculate the self-normalized score separately between these cases. Results are shown in Figure 5 . With augmentation, we observe a clear peak in performance at τ = 0, equivalent to a target encoder with no EMA-based smoothing. Without augmentation, however, the story is less clear, and the method appears less sensitive to τ (note y-axis scales). We use τ = 0.99 in this case, based on its reasonable performance and consistency with prior work (e.g., Grill et al., 2020) . Overall, however, we note that SPR does not appear overly sensitive to τ , unlike purely unsupervised methods such as BYOL; in no case does SPR fail to train. We hypothesize that the difference between the augmentation and no-augmentation cases is partially due to augmentation rendering the stabilizing effect of using an EMA target network (e.g., as observed by Grill et al., 2020; Tarvainen & Valpola, 2017) redundant. Prior work has already noted that using an EMA target network can slow down learning early in training (Tarvainen & Valpola, 2017) ; in our context, where a limited number of environment samples are taken in parallel with optimization, this may "waste" environment samples by collecting them with an inferior policy. To resolve this, Tarvainen & Valpola (2017) proposed to increase τ over the course of training, slowing down changes to the target network later in training. It is possible that doing so here could allow SPR to achieve the best of both worlds, but it would require tuning an additional hyperparameter, the schedule by which τ is increased, and we thus leave this topic for future work.



Cosine similarity is linearly related to the "normalized L2" loss used in BYOL(Grill et al., 2020). This is perhaps not surprising, given that the model used by DrQ omits many of the components of Rainbow. τ ∈ {0.999, 0.9976, 0.9944, 0.9867, 0.9684, 0.925, 0.8222, 0.5783, 0}. Pong, Breakout, Up N Down, Kangaroo, Bank Heist, Assault, Boxing, BattleZone, Frostbite and Crazy Climber



Figure 1: Median and Mean Human-Normalized scores of different methods across 26 games in the Atari 100k benchmark(Kaiser et al., 2019), averaged over 10 random seeds for SPR, and 5 seeds for most other methods except CURL, which uses 20. Each method is allowed access to only 100k environment steps or 400k frames per game. (*) indicates that the method uses data augmentation. SPR achieves state-of-art results on both mean and median human-normalized scores. Note that, even without data augmentation, SPR still outperforms all prior methods on both metrics.

Figure2: An illustration of the full SPR method. Representations from the online encoder are used in the reinforcement learning task and for prediction of future representations from the target encoder via the transition model. The target encoder and projection head are defined as an exponential moving average of their online counterparts and are not updated via gradient descent. For brevity, we illustrate only the k th step of future prediction, but in practice we compute the loss over all steps from 1 to K. Note: our implementation for this paper includes g o in the Q-learning head.

Figure 3: A boxplot of the distribution of human-normalized scores across the 26 Atari games under consideration, after 100k environment steps. The whiskers represent the interquartile range of human-normalized scores over the 26 games. Scores for each game are recorded at the end of training and averaged over 10 random seeds for SPR, 20 for CURL, and 5 for other methods.

Figure4: Performance of SPR with various prediction depths. Results are averaged across ten seeds per game, for all 26 games. To equalize the importance of games, we calculate an SPR-normalized score analogously to human-normalized scores, and show its mean and median across all 26 games. All other hyperparameters are identical to those used for SPR with augmentation.

Figure 5: Performance on a subset of 10 Atari games for different values of the EMA parameter τ with augmentation (left) and without (right). Scores are averaged across 10 seeds per game for each value of τ . Self-normalized score is calculated separately for the augmentation and no-augmentation cases.

3.1 DATA-EFFICIENT RL:A number of works have sought to improve sample efficiency in deep RL. SiMPLe(Kaiser et al., 2019) learns a pixel-level transition model for Atari to generate simulated training data, achieving strong results on several games in the 100k frame setting, at the cost of requiring several weeks for training. However,van Hasselt et al. (2019)  and Kielak (2020) introduce variants of Rainbow(Hessel et al., 2018) tuned for sample efficiency, Data-Efficient Rainbow (DER) and OTRainbow, which achieve comparable or superior performance with far less computation.

al. (2019)  andvan Hasselt  et al. (2019). In this task, only 100,000 environment steps of training data are available -equivalent to 400,000 frames, or just under two hours -compared to the typical standard of 50,000,000 environment steps, or roughly 39 days of experience. When used without data augmentation, SPR demonstrates scores comparable to the previous best result fromYarats et al. (2021). When combined with data augmentation, SPR achieves a median human-normalized score of 0.415, which is a new state-of-theart result on this task. SPR achieves super-human performance on seven games in this data-limited setting: Boxing, Krull, Kangaroo, Road Runner, James Bond and Crazy Climber, compared to a maximum of two for any previous methods, and achieves scores higher than DrQ (the previous state-of-the-art method) on 23 out of 26 games. See Table1for aggregate metrics and Figure3for a visualization of results. A full list of scores is presented in Table4in the appendix. For consistency with previous works, we report human and random scores fromWang et al. (2016). Performance of different methods on the 26 Atari games considered byKaiser et al. (2019) after 100k environment steps. Results are recorded at the end of training and averaged over 10 random seeds for SPR, 20 for CURL, and 5 for other methods. SPR outperforms prior methods on all aggregate metrics, and exceeds expert human performance on 7 out of 26 games.

Scores on the 26 Atari games under consideration for ablated variants of SPR. All variants listed here use data augmentation.Additionally, we note that the standard evaluation protocol of evaluating over only 500,000 frames per game is problematic, as the quantity we are trying to measure is expected return over episodes. Because episodes may last up to up to 108,000 frames, this method may collect as few as four complete episodes. As variance of results is already a concern in deep RL (seeHenderson et al.,  2018), we recommend evaluating over 100 episodes irrespective of their length. Moreover, to address findings from Henderson et al. (

Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=GY6-6sTvGaf. Cited on pages 5, 6, and 7.

Hyperparameters for SPR on Atari, with and without augmentation.We provide full results across all 26 games for the methods considered, including SPR with and without augmentation, in Table4. Methods are ordered in rough order of their date of release or publication.

Mean episodic returns on the 26 Atari games considered byKaiser et al. (2019) after 100k environment steps. The results are recorded at the end of training and averaged over 10 random seeds. SPR outperforms prior methods on all aggregate metrics, and exceeds expert human performance on 7 out of 26 games while using a similar amount of experience.

Scores on the 26 Atari games under consideration for our controlled Rainbow implementation with and without augmentation, compared to previous methods. The high mean DQN-normalized score of our DQN without augmentation is due to an atypically high score on Private Eye, a hard exploration game on which the original DQN achieves a low score.

Scores on the 26 Atari games under consideration for variants of SPR with different target encoder schemes, without augmentation.

ACKNOWLEDGEMENTS

We are grateful for the collaborative research environment provided by Mila and Microsoft Research. We would also like to acknowledge Hitachi for providing funding support for this project. We thank Nitarshan Rajkumar and Evan Racah for providing feeback on an earlier draft; Denis Yarats and Aravind Srinivas for answering questions about DrQ and CURL; Michal Valko, Sherjil Ozair and the BYOL team for long discussions about BYOL, and Phong Nguyen for helpful discussions. Finally, we thank Compute Canada and Microsoft Research for providing computational resources used in this project.

annex

Published as a conference paper at ICLR 2021 • A contrastive loss based solely on different views of the same state, similar to CURL (Srinivas et al., 2020) .• A temporal contrastive loss with both augmentation and where targets are drawn one step in the future, equivalent to single-step CPC (Oord et al., 2018) .• A temporal contrastive loss with an explicit dynamics model, similar to CPC|Action (Guo et al., 2018) . Predictions are made up to five steps in the future, and encodings of every state except s t+k are used as negative samples for s t+k .• A soft contrastive approach inspired by Wang & Isola (2020) , who propose to decouple the repulsive and attractive effects of contrastive learning into two separate losses, one of which is similar to the SPR objective and encourages representations to be invariant to augmentation or noise, and one of which encourages representations to be uniformly distributed on the unit hypersphere. We optimize this uniformity objective jointly with the SPR loss, which takes the role of the "invariance" objective proposed by (Wang & Isola, 2020) . We use t = 2 in the uniformity loss, and give it a weight equal to that given to the SPR loss, based on hyperparameters used by Wang & Isola (2020) .To create as fair a comparison as possible, we use the same augmentation (random shifts and intensity) and the same Rainbow hyperparameters as in SPR with augmentation. As in SPR, we calculate contrastive losses using the output of the first layer of the Q-head MLP, with a bilinear classifier (as in Oord et al., 2018) . Following Chen et al. (2020a), we use annealed cosine similarities with a temperature of 0.1 in the contrastive loss. We present results in Table 6 .Although all of these variants outperform the previous contrastive result on this task, CURL, none of them substantially improve performance over the controlled Rainbow they use as a baseline. We consider these results broadly consistent with those of CURL, which observes a relatively small performance boost over their baseline, Data-Efficient Rainbow (van Hasselt et al., 2019) .

C THE ROLE OF THE TARGET ENCODER IN SPR

We consider several variants of SPR with the target network modified, and present aggregate metrics for these experiments in Table 7 . We first evaluate a a variant of SPR in which target representations are drawn from the online encoder and gradients allowed to propagate into the online encoder through them, effectively allowing the encoder to learn to make its representations more predictable. We find that this leads to drastic reductions in performance both with and without augmentation, which we attribute to representational collapse.

D WALL CLOCK TIMES

We report wall-clock runtimes for a selection of methods in Table 8 . SPR with augmentation for a 100K steps on Atari takes around 4 and a half to finish a complete training and evaluation run on a single game. We find that using data augmentation adds an overhead, and SPR without augmentation can run in just 3 hours.SPR's wall-clock run-time compares very favorably to previous works such as SimPLe (Kaiser et al., 2019) , which requires roughly three weeks to train on a GPU comparable to those used for SPR. 

