A LARGE SCALE SAMPLE COMPLEXITY ANALYSIS OF NEURAL POLICIES IN THE LOW-DATA REGIME Anonymous authors Paper under double-blind review

Abstract

The progress in reinforcement learning algorithm development is at one of its highest points starting from the initial study that enabled sequential decision making from high-dimensional observations. Currently, deep reinforcement learning research has had quite recent breakthroughs from learning without the presence of rewards to learning functioning policies without even knowing the rules of the game. In our paper we focus on the trends currently used in deep reinforcement learning algorithm development in the low-data regime. We theoretically show that the performance profiles of the algorithms developed for the high-data regime do not transfer to the low-data regime in the same order. We conduct extensive experiments in the Arcade Learning Environment and our results demonstrate that the baseline algorithms perform significantly better in the low-data regime compared to the set of algorithms that were initially designed and compared in the large-data region.

1. INTRODUCTION

Reinforcement learning research achieved high acceleration upon the proposal of the initial study on approximating the state-action value function via deep neural networks (Mnih et al., 2015) . Following this initial study several different highly successful deep reinforcement learning algorithms have been proposed (Hasselt et al., 2016b; Wang et al., 2016; Hessel et al., 2018; 2021) from targeting different architectural ideas to employing estimators targeting overestimation, all of which were designed and tested in the high-data regime (i.e. two hundred million frame training). An alternative recent line of research with an extensive of amount of publications focused on pushing the performance bounds of deep reinforcement learning policies in the low-data regime (Yarats et al., 2021; Ye et al., 2021; Kaiser et al., 2020; van Hasselt et al., 2019; Kielak, 2019) (i.e. with one hundred thousand environment interaction training). Several different unique ideas in current reinforcement learning research, from model-based reinforcement learning to increasing sample efficiency with observation regularization, gained acceleration in several research directions based on policy performance comparisons demonstrated in the Arcade Learning Environment 100K benchmark. However, we demonstrate that there is a significant overlooked assumption being made in this line of research without being explicitly discussed. This implicit assumption, that is commonly shared amongst a large collection of low-data regime studies carries a significant importance due to the fact that these studies shape future research directions with incorrect reasoning; hence, influencing the overall research efforts put in for particular research ideas for several years following. Thus, in our paper we target this implicit assumption and aim to answer the following questions: • How can we theoretically explain the relationship between asymptotic sample complexity versus the low-data regime sample complexity in deep reinforcement learning? • How would the performance profiles of deep reinforcement learning algorithms designed for the high-data regime transform to the low-data regime? • Can we expect the performance rank of algorithms to hold with variations on the number of samples used in policy training? Hence, to be able to answer the questions raised above in our paper we focus on sample complexity in deep reinforcement learning and make the following contributions: • We provide theoretical foundation on the non-transferability of the performance profiles of deep reinforcement learning algorithms designed for the high-data regime to the low-data regime. • We theoretically demonstrate that the performance profile has a non-monotonic relationship with the asymptotic sample complexity and the low-data sample complexity region. Furthermore, we prove that the sample complexity of distributional reinforcement learning is higher than the sample complexity of baseline deep Q-network algorithms. • We conduct large scale extensive experiments for a variety of deep reinforcement learning baseline algorithms in both the low-data regime and the high-data regime Arcade Learning Environment benchmark. • We highlight that recent algorithms proposed and evaluated in the Arcade Learning Environment 100K benchmark are significantly affected by the implicit assumption on the relationship between performance profiles and sample complexity.

2.1. DEEP REINFORCEMENT LEARNING

The reinforcement learning problem is formalized as a Markov Decision Process (MDP) represented as a tuple S, A, P, R, γ, ρ 0 where S represents the state space, A represents the set of actions, P represents the transition probability distribution P on S × A × S, R : S × A → R represents the reward function, and γ ∈ (0, 1] represents the discount factor. The aim in reinforcement learning is to learn an optimal policy π(s, a) that maps state observations to actions π : S → ∆(A) that will maximize expected cumulative discounted rewards. R = E at∼π(st,•) t γ t R(s t , a t , s t+1 ), This objective is achieved by constructing a state-action value function that learns for each stateaction pair the expected cumulative discounted rewards that will be obtained if action a ∈ A is executed in state s ∈ S. Q(s t , a t ) = R(s t , a t , s t+1 ) + γ st P(s t+1 |s t , a t )V(s t+1 ). In settings where the state space and/or action space is large enough that the state-action value function Q(s, a) cannot be held in a tabular form, a function approximator is used. Thus, for deep reinforcement learning the Q-function is approximated via deep neural networks. θ t+1 = θ t + α(R(s t , a t , s t+1 ) + γQ(s t+1 , arg max a Q(s t+1 , a; θ t ); θ t ) -Q(s t , a t ; θ t ))∇ θt Q(s t , a t ; θ t ). Dueling Architecture: At the end of convolutional layers for a given deep Q-Network, the dueling architecture outputs two streams of fully connected layers for both estimating the state values V(s) and the advantage A(s, a) for each action in a given state s. A(s, a) = Q(s, a) -max a Q(s, a) In particular, the last layer of the dueling architecture contains the forward mapping Q(s, a; θ, α, β) = V(s; θ, β) + A(s, a; θ, α) -max a ∈A A(s, a ; θ, α) where θ represents the parameters of the convolutional layers and α and β represent the parameters of the fully connected layers outputting the advantage and state value estimates respectively. Distributional Reinforcement Learning: The baseline distributional reinforcement learning algorithm C51 was proposed by Bellemare et al. (2017) . The projected Bellman update for the i th atom is computed as (ΦT Z θ (s t , a t )) i = N -1 j 1 - |[T z j ] vmax vmin -z i | ∆z 1 0 τ j (s t+1 , max a∈A EZ θ (s t+1 , a)) where z i = v min + i∆z : 0 ≤ i < N represents the set of atoms in categorical learning, and the atom probabilities are learnt as a parametric model st,at) j e θj (st,at) where ∆z := τ i (s t , max a∈A EZ θ (s t , a)) = e θi( v max -v min N -1 (6) Following this baseline algorithm Dabney et al. (2018b)  L = 1 K K i=1 K j=1 ρ δ (R(s t , a t , s t+1 ) + γZ δj (s t+1 , arg max a∈A Q β (s t , a t )) -Z δi (s t , a t )) where ρ δ represents the Huber quantile regression loss, and Q β = 1 0 F -1 Z (δ)dβ(δ). Note that Z δ = F -1 Z (δ) is the quantile function of the random variable Z at δ ∈ [0, 1].

3. LOW-DATA REGIME VERSUS ASYMPTOTIC PERFORMANCE

The high-level message of our empirical results is that comparing the asymptotic performance of two reinforcement learning algorithms does not necessarily give useful information on their relative performance in the low-data regime. In this section we provide mathematical motivation for this claim in the setting of finite-horizon MDPs with linear function approximation. In particular, a finite horizon MDP is represented as a tuple where we let π(s) be the action taken by the policy π in state s, and the corresponding value function V π t (s t ) = Q t (s t , π(s t )). The optimal non-stationary policy π * has value function V * t (s t ) = V π * t (s t ) satisfying V * t (s t ) = sup π V π t (s t ). ( ) The objective is to learn a sequence of non-stationary policies π k for k ∈ {1, . . . , K} while interacting with an unknown MDP in order to minimize the regret, which is measured asymptotically over K episodes of length H REGRET(K) = K k=1 V * 1 (s k 1 ) -V π k 1 (s k 1 ) where s k 1 ∈ S is the starting state of the k-th episode. In words, regret sums up the gap between the expected rewards obtained by the sequence of learned policies π k and those obtained by π * when learning for K episodes. In the linear function approximation setting there is a feature map φ t : S × A → R dt for each t ∈ [H] that sends a state-action pair (s, a) to the d t -dimensional vector φ t (s, a). Then, the state-action value function Q t (s t , a t ) is parameterized by a vector θ t ∈ d t so that Q t (θ t )(s t , a t ) = φ t (s, a) θ t . Recent theoretical work in this setting Zanette et al. (2020) gives an algorithm along with a lower bound that matches the regret achieved by the algorithm up to logarithmic factors. Theorem 3.1 (Zanette et al. (2020) ). Under appropriate normalization assumptions there is an algorithm that learns a sequence of policies π k achieving regret REGRET(K) = Õ H t=1 d t √ K + H t=1 d t IK , ( ) where I is the intrinsic Bellman error. Furthermore, this regret bound is optimal for this setting up to logarithmic factors in d t , K and H whenever K = Ω(( H t=1 d t ) 2 ), in the sense that for any level of intrinsic Bellman error I, there exists a class of MDPs where any algorithm achieves at least as much regret on at least one MDP in the class. Utilizing this theorem we can then prove the following proposition on the relationship between the performance in the asymptotic and low-data regimes. Proposition 3.2. For every 1 > α > β > 0, there exist two thresholds U > L > 1, and a class of finite-horizon MDPs and feature maps φ t each of dimension d t such that 1. Every algorithm receives regret at least REGRET(K) = Ω (αK) after K ≤ L episodes. 2. There exists an algorithm receiving regret REGRET(K) = Õ (βK) after K ≥ U episodes. Proof. Let I = β H t=1 √ dt and apply the lower bound of Theorem 3.1 with intrinsic Bellman error I. Let L = O( 1 (α-β) 2 ( H t=1 d t ) 2 ). Then after K episodes for K ≤ L, the regret of any algorithm on the class of MDPs guaranteed by the theorem is at least REGRET(K) = Ω H t=1 d t √ K + H t=1 d t IK = Ω H t=1 d t √ K + βK = Ω (α -β) √ K 1 α -β H t=1 d t + βK ≥ Ω ((α -β)K + βK) = Ω (αK) where the second equality follows from the choice of I, and the inequality from the fact that K ≤ L = O( 1 (α-β) 2 ( H t=1 d t ) 2 ). Next fix any constant > 0 and let U = Ω(( H t=1 d t ) 2 1-2 ). Then for K ≥ U episodes the algorithm guaranteed by Theorem 3.1 achieves regret REGRET(K) = Õ H t=1 d t √ K + H t=1 d t IK = Õ H t=1 d t √ K + βK ≤ Õ K 1-+ βK = Õ (βK) where the inequality follows from the bound K ≥ U = Ω(( H t=1 d t ) 2 1-2 ). Intuitively Proposition 3.2 shows that in the linear function approximation setting, the gap between performance in the low-data regime (K ≤ L episodes) and the high-data/asymptotic regime (K ≥ U episodes) can be arbitrarily large. Thus, comparisons between algorithms in the asymptotic/highdata regime are not informative when trying to understand algorithm performance with limited data.

4. MEAN ESTIMATION VERSUS LEARNING THE DISTRIBUTION

To obtain theoretical insight into the larger sample complexity exhibited by distributional reinforcement learning we consider the fundamental comparison between learning the distribution of a random variable X versus only learning the mean E[X ]. In base distributional reinforcement learning the goal is to learn a distribution over state-action values that has finite support. Thus, to get a fundamental understanding of the additional cost of distributional reinforcement learning, we compare the sample complexity of learning the distribution of a finitely supported random variable with that of estimating the mean. Proposition 4.1. Let X be a real-valued random variable with support on exactly k known values. Further, assume |X | < 1 and let > 0. Any algorithm that learns the distribution P(X ) within total variation distance requires Ω(k/ 2 ) samples, while there exists an algorithm to estimate E[X ] to within error using only O(1/ 2 ) samples. Proof. See appendix for the full proof. Although Proposition 4.1 proves that distributional reinforcement learning has an intrinsically higher sample complexity than that of standard Q-learning, it does not provide insights into the comparison of an error of in the mean with an error of in total variation distance. Hence, the following proposition demonstrates a precise justification of the comparison: whenever there are two different actions where the true mean state-action values are within , an approximation error of in total variation distance for the state-action value distribution of one of the actions is sufficient to reverse the order of the means. Proposition 4.2. Fix a state s and consider two actions a, a . Let X (s, a) be the random variable distributed as the true state-action value distribution of (s, a), and X (s, a ) be the same for (s, a ). Suppose that E[X (s, a)] = E[X (s, a )] + . Then there is a random variable Y such that d T V (Y, X (s, a)) ≤ and E[X (s, a )] ≥ E[Y]. Proof. Let τ * ∈ R be the infimum τ * = inf{τ ∈ R | P[X (s, a) ≥ τ ] = } i.e. τ * is the first point in R such that X (s, a) takes values at least τ * with probability exactly . Next let the random variable Y be defined by the following process. First, sample the random variable X (s, a). If X (s, a) ≥ τ * , then output τ * -1. Otherwise, output the sampled value of X (s, a). Observe that the probability distributions of Y and X (s, a) are identical except at the point τ * -1 and on the interval [τ * , ∞). Let µ be the Lebesgue measure on R. By construction of Y the total variation distance is given by d T V (Y, X ) = 1 2 R P[X (s, a) = z] -P[Y = z] dµ(z) = 1 2 P[X (s, a) = τ * -1] -P[Y = τ * -1] + 1 2 [τ * ,∞) P[X (s, a) = z] -P[Y = z] dµ(z) = 2 + 2 = . Next note that the expectation of Y is given by E[Y] = (τ * -1) + (-∞,τ * ] zP[X (s, a) = z] dµ(z) = (τ * -1) + R zP[X (s, a) = z] dµ(z) - (τ * ,∞] zP[X (s, a) = z] dµ(z) ≤ (τ * -1) + E[X (s, a)] -τ * = E[X (s, a)] - where the inequality follows from the fact that X takes values larger than τ * with probability . To summarize, Proposition 4.2 shows that, in the case where the mean state-action values are within , unless the state-action value distribution is learned to within total-variation distance , the incorrect action may be selected by the distributional reinforcement learning policy. Therefore, it is natural to compare the sample complexity of learning the state-action value distribution to within totalvariation distance with the sample complexity of simply learning the mean to within error , as is done in Proposition 4.1.

5. SAMPLE COMPLEXITY WITH UNKNOWN SUPPORT

The setting considered in Proposition 4.1 most readily applies to the base distributional reinforcement learning algorithm C51, which attempts to directly learn a discrete distribution with known support in order to approximate the state-action value distribution. However, further advances in distributional reinforcement learning including QRDQN and IQN do away with the assumption that the support of the distribution is known. This allows a more flexible representation in order to more accurately represent the true distribution on state-action values, but, as we will show, potentially leads to a further increase in the sample complexity. The QRDQN algorithm models the distribution of state-action values as a uniform mixture of N Dirac deltas on the reals i.e. Z(s, a) = 1 N N i=1 δ θi(s,a) , where θ i (s, a) ∈ R is a parametric model. Proposition 5.1. Let N > M ≥ 2, > M 4N , and θ i ∈ R for i ∈ [N ]. The number of samples required to learn a distribution of the form Z = 1 N N i=1 δ θi to within total variation distance is Ω M 2 . Proof. Let M ≥ 2 and D = {1, 2, • • • , M} ⊆ R. First we will show that any distribution p(z) supported on z ∈ D is within total-variation distance k 4N of a distribution of a random variable of the form Z = 1 N N i=1 δ θi for numbers θ i ∈ D. Indeed we can construct such a distribution as follows. First let p(z) be the rounded distribution obtained by rounding each probability p(z) to the nearest integer multiple of 1 N . The total variation distance between p(z) and p(z) is given by 1 2 M z=1 |p(z) -p(z)| ≤ 1 2 M z=1 1 2N ≤ M 4N . Next partition the set of θ i into M groups G 1 , G 2 , . . . , G M , where group G z has size N p(z) (this size is an integer by construction of p). Finally, for each θ i ∈ G z assign θ i = z. Thus for Z = 1 N N i=1 δ θi we have for each z ∈ D P[Z = z] = 1 N N i=1 1[θ i = z] = 1 N |G z | = p(z). Therefore, any distribution p(z) can be approximated to within total variation distance M 4N by a distribution Z of the prescribed form. Thus, by the sample complexity lower bounds for learning a discrete distribution with known support, for any > M 4N at least M 2 samples are required to learn a distribution of the form Z = 1 N N i=1 δ θi . Depending on the choice of parameters, the lower bound in Proposition 5.1 can be significantly larger than that of Proposition 4.1. For example if the desired approximation error is = 1 8 one can take M = N 2 . In this case if the value of k in Proposition 4.1 satisfies k = o(N ), then the sample complexity in Proposition 5.1 is asymptotically larger than that of Proposition 4.1.

6. LARGE SCALE EXPERIMENTAL INVESTIGATION

The experiments are conducted in the Arcade Learning Environment (ALE) (Bellemare et al., 2013) . The Double Q-learning algorithm is trained via Double Deep Q-Network (Hasselt et al., 2016a) initially proposed by van Hasselt (2010). The dueling algorithm is trained via Wang et al. (2016) . The prior algorithm refers to the prioritized experience replay algorithm proposed by Schaul et al. (2016) . The distributional reinforcement learning policies are trained via the C51 algorithm, IQN and QRDQN. To provide a complete picture of the sample complexity we conducted our experiments in both low-data, i.e. the Arcade Learning Environment 100K benchmark, and high data regime, i.e. baseline 200 million frame training. All of the results are reported with the standard error of the mean in all of the tables and figures in the paper. The experiments are run with JAX (Bradbury et al., 2018) , with Haiku as the neural network library, Optax (Hessel et al., 2020) as the optimization library, and RLax for the reinforcement learning library (Babuschkin et al., 2020) . More details on the hyperparameters and direct references to the implementations can be found in the appendix. Note that human normalized score is computed as follows: Score HN = Score agent -Score random Score human -Score random (16) Figure 1 reports learning curves for the IQN, QRDQN, dueling architecture and C51 for every MDP in the Arcade Learning Environment low-data regime 100K benchmark. These results demonstrate that the simple base algorithm dueling performs significantly better than any distributional algorithm when the training samples are limited. For a fair, direct and transparent comparison we kept the hyperparameters for the baseline algorithms in the low-data regime exactly the same with the DRQ ICLR paper (see appendix for the full list and high-data regime hyperparameter settings). Note that the DRQ algorithm uses the dueling architecture without any distributional reinforcement learning. One intriguing takeaway from the results provided in Table 1 and the Figure 4 1 is the fact that the simple base algorithm dueling performs 15% better than the DRQ NeurIPS implementation, and 11% less than the DRQ ICLR implementation. Note that the original paper of the DRQ ICLR algorithm provides comparison only to data-efficient Rainbow (DER) (van Hasselt et al., 2019) which inherently uses distributional reinforcement learning. The fact that the original paper that proposed data augmentation for deep reinforcement learning (i.e. DRQ ICLR ) on top of the dueling architecture did not provide comparisons with the pure simple base dueling architecture (Wang et al., 2016) resulted in inflated performance profiles for the DRQ ICLR algorithm. More intriguingly, the comparisons provided in the DRQ ICLR paper to the DER and OTR algorithms report the performance gained by DRQ ICLR over DER is 82% and over OTR is 35%. However, if a 1 DER 2021 refers to the re-implementation with random seed variations of the original paper data-efficient Rainbow (i.e. DER 2019 ) by van Hasselt et al. (2019) . OTR refers to further implementation of the Rainbow algorithm by Kielak (2019) . DRQ NeurIPS refers to the re-implementation of the original DRQ algorithm published in ICLR as a spotlight presentation with the goal of achieving reproducibility with variation on the number of random seeds (Agarwal et al., 2021) . 1 demonstrates with the exact hyperparameters used as in the DRQ ICLR paper the performance gain is utterly restricted to 11%. Moreover, when it is compared to the reproduced results of DRQ NeurIPS it turns out that there is a performance decrease due to utilizing the DRQ algorithm over dueling architecture. Thus, the fact that our paper provides foundations on the non-transferability of the performance profile results from large-data regime to low-data regime can influence future research to have more concrete and accurate performance profiles for algorithm development in the low-data regime. Table 1 reports the human normalized median, human normalized mean, and human normalized 20 th percentile results over all of the MDPs from the 100K Arcade Learning Environment benchmark for DQN, Double-Q, dueling, C51, QRDQN, IQN and prior. One important takeaway from the results reported in the Table 1 is the fact that one particular algorithm performance profile in 200 million frame training will not directly transfer to the low-data region. Figure 2 reports the learning curves of human normalized median, human normalized mean and human normalized 20 th percentile for the dueling algorithm, C51, QRDQN, and IQN in the low-data region. These results once more demonstrate that the performance profile of the simple base algorithm dueling is significantly better than any other distributional reinforcement learning algorithm when the number of environment interactions are limited. The left and center plots of Figure 3 report regret curves corresponding to the theoretical analysis in Proposition 3.2 for various choices of the feature dimensionality d and the intrinsic Bellman error I. In particular, the left plot shows the low-data regime where the number of episodes K < 1000, while the right plot shows the high-data regime where K is as large as 500000. Notably, the relative ordering of the regret across the different choices of d and I is completely reversed in the high-data regime when compared to the low-data regime. Figure 4 reports results on the number of samples required for training with the baseline distributional reinforcement learning algorithm to reach the same performance levels achieved by the dueling algorithm for each individual MDP from the Arcade Learning Environment low-data regime benchmark. These results once more demonstrate that to reach the same performance levels with the dueling algorithm, the baseline distributional reinforcement learning algorithm requires orders of magnitude more samples to train on. As discussed in Section 5, more complex representations for broader classes of distributions come at the cost of a higher sample complexity required for learning. One intriguing fact is that the original SimPLE paper provides a comparison in the low-data regime of their proposed algorithm with the Rainbow algorithm which is essentially an algorithm that is designed in the high-data region by having the implicit assumption that the state-of-the art performance profile must transfer linearly to the low-data region. These instances of implicit assumptions also occur in DRQ ICLR , CURL, SPR and Efficient-Zero even when comparisons are made for more advanced algorithms such as MuZero.

7. CONCLUSION

In this paper we aimed to answer the following questions: (i) Do the performance profiles of deep reinforcement learning algorithms designed for certain data regimes translate approximately linearly to a different sample complexity region?, and (ii) What is the underlying theoretical relationship between the performance profiles and sample complexity regimes? To be able to answer these questions we provide theoretical investigation on the sample complexity of the baseline deep reinforcement learning algorithms. We conduct extensive experiments both in the low-data region 100K Arcade Learning Environment and high-data regime baseline 200 million frame training. Our results demonstrate that the performance profiles of deep reinforcement learning algorithms do not have a monotonic relationship across sample complexity regimes. The implicit assumption of the monotonic relationship of the performance characteristics and the sample complexity regions that exists in many recent state-of-the-art studies has been overly exploited. Thus, our results demonstrate that several baseline Q algorithms are almost as high performing as recent variant algorithms that have been proposed and shown as the state-of-the-art.

A APPENDIX

A.1 RESULTS ON THE COMPLETE LIST OF GAMES FROM ARCADE LEARNING ENVIRONMENT 100K BASELINE Table 2 reports the average scores obtained by the human player, random player, baseline Qbased algorithm dueling architecture, baseline distributional reinforcement learning algorithm C51, QRDQN and IQN across all the games in the Arcade Learning Environment 100K baseline. These results once more demonstrate that the baseline Q-based algorithm performs significantly better than any distributional reinforcement learning algorithm as has also been explained in detail in Section 5. 5 demonstrate that the number of samples required to obtain the performance level achieved via the simple base dueling architecture is significantly higher for any distributional reinforcement learning algorithm. Note that the distributional reinforcement learning algorithm C51 represents the state-action value distribution as a discrete probability distribution supported on 51 fixed atoms evenly spaced between a pre-specified minimum and maximum value. In contrast, QR-DQN represents the value distribution as the uniform distribution over a larger number of atoms with variable positions on the real line. Thus, QR-DQN is able to more accurately approximate a broader class of state-action value distributions. Finally, IQN parameterizes the quantile function of the state-action value distribution via a deep neural network, leading to a yet more flexible representation of the state-action value distribution. As discussed in Section 5, more complex representations for broader classes of distributions come at the cost of a higher sample complexity required for learning. For a fair and transparent comparison, we kept the hyperparameters exactly the same with the DRQ ICLR paper for all of the baseline Q algorithms in the low-data region. Note that DRQ is an observation regularization study; hence the hyperparameters in the DRQ paper are specifically tuned for the purpose of the paper besides tuning for the Arcade Learning Environment 100K lowdata regime. We did not tune any of the hyperparameters for the baseline algorithms (i.e. dueling Under review as a conference paper at ICLR 2023 architecture and distributional reinforcement learning algorithms). Hence, it is even further possible to conduct hyperparameter tuning and get better performance profile results with the simple baseline dueling architecture. For the purpose of this paper we kept the hyperparameters exactly the same with the DRQ ICLR paper. However, we would strongly encourage further research to conduct hyperparameter optimization to obtain better results from the baseline dueling architecture in the low-data regime. We have also tried the hyperparameter settings reported in the data efficient Rainbow (DER) paper for C51, IQN and QRDQN in the low-data regime. The performance results are provided in Table 4 for the hyperparameter settings of DER. As can be seen, the hyperparameter settings of DRQ ICLR gave better performance results also for C51, IQN and QRDQN in the low-data region. The results in Table 4 also align with the claims of the DER paper in which there was not been extensive hyperparameter tuning conducted to achieve the results provided, and it is possible to obtain better results by further hyperparameter tuning. 



Figure 1: The learning curves of Alien, Amidar, Asterix, BankHeist, ChopperCommand, Hero, CrazyClimber, JamesBond, Kangaroo, MsPacman, FrostBite, Qbert, RoadRunner, Seaquest and UpNDown with dueling architecture, C51, IQN and QRDQN algorithms in the Arcade Learning Environment with 100K environment interaction training. See appendix for the full learning curves.

Figure 2: Up: Human normalized median, mean and 20 th percentile results for the dueling algorithm, C51, IQN and QRDQN in the Arcade Learning Environment 100K benchmark. Down: Human normalized median, mean, and 20 th percentile results for the dueling algorithm, C51, IQN and QRDQN in the high-data regime towards 200 million frame.

Figure 3: Left: Regret in the low-data regime. Center: Regret in the high-data regime. Right: Distributional vs baseline Q comparison of algorithms that were proposed and developed in the high-data regime in the Arcade Learning Environment in both high-data regime and low-data regime.

Figure 5: The learning curves of Alien, Amidar, Asterix, BankHeist, BattleZone, Boxing, Breakout, ChopperCommand, Hero, CrazyClimber, JamesBond, Kangaroo, PrivateEye, MsPacman, FrostBite, Qbert, RoadRunner, Seaquest, Pong, Gopher, DemonAttack, Krull, and UpNDown with dueling architecture Wang et al. (2016), C51, IQN and QRDQN algorithms in the Arcade Learning Environment with 100K environment interaction training.



Average returns for human, random, duelingWang et al. (2016), C51, QRDQN and IQN across all the games in the Arcade Learning Environment 100K benchmark.Figure5reports the learning curves of the complete list of the games in the Arcade Learaning Environment 100K benchmark; in particular, for Alien, Amidar, Asterix, BankHeist, BattleZone, Boxing, Breakout, ChopperCommand, Hero, CrazyClimber, JamesBond, Kangaroo, PrivateEye, MsPacman, FrostBite, Qbert, RoadRunner, Seaquest, Pong, Gopher, DemonAttack, Krull, and UpNDown with dueling architectureWang et al. (2016), C51, IQN and QRDQN algorithms with 100K environment interaction training. The learning curves reported in Figure

Hyperparameter settings and architectural details for the dueling algorithm, double-Q learning, C51, QRDQN, and IQN in the low-data regime of the Arcade Learning Environment.

Human normalized mean, human normalized median, and human normalized 20 th percentile results for the C51 algorithm, QRDQN, and IQN in the low-data regime of the Arcade Learning Environment with the hyperparameter settings reported in the DER paper.

A.3 MEAN ESTIMATION VERSUS LEARNING THE DISTRIBUTION

Proposition A.1 (Proposition 4.1). Let X be a real-valued random variable with support on exactly k known values. Further, assume |X | < 1 and let > 0. Any algorithm that learns the distribution P(X ) within total variation distance requires Ω(k/ 2 ) samples, while there exists an algorithm to estimate E[X ] to within error using only O(1/ 2 ) samples.Proof. Learning a distribution with known discrete support of size k requires Ω(k/ 2 ) samples to achieve total variation distance at most with constant probability (Canonne, 2020). On the other hand, let X 1 , . . . , X n be independent samples of the random variable X and consider the sample meanThe expectation is given by E[ X ] = E[X ] and the variance is σ 2 ( X ) = 1 n σ 2 (X ). Further, since |X | < 1 we have that σ 2 (X) < 1 and so σ 2 ( X ) ≤ 1 n . Hence, by Chebyshev's inequalityThus with n = O( 1 2 ) samples, X is within of E[X ] with constant probability.

