ADDRESSING EXTRAPOLATION ERROR IN DEEP OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement learning (RL) encompasses both online and offline regimes. Unlike its online counterpart, offline RL agents are trained using logged-data only, without interaction with the environment. Therefore, offline RL is a promising direction for real-world applications, such as healthcare, where repeated interaction with environments is prohibitive. However, since offline RL losses often involve evaluating state-action pairs not well-covered by training data, they can suffer due to the errors introduced when the function approximator attempts to extrapolate those pairs' value. These errors can be compounded by bootstrapping when the function approximator overestimates, leading the value function to grow unbounded, thereby crippling learning. In this paper, we introduce a three-part solution to combat extrapolation errors: (i) behavior value estimation, (ii) ranking regularization, and (iii) reparametrization of the value function. We provide ample empirical evidence on the effectiveness of our method, showing state of the art performance on the RL Unplugged (RLU) ATARI dataset. Furthermore, we introduce new datasets for bsuite as well as partially observable DeepMind Lab environments, on which our method outperforms state of the art offline RL algorithms.

1. INTRODUCTION

Agents are, fundamentally, entities which map observations to actions and can be trained with reinforcement learning (RL) in either an online or offline fashion. When trained online, an agent learns through trial and error by interacting with its environment. Online RL has had considerable success recently: on Atari (Mnih et al., 2015) , the game of GO (Silver et al., 2017) , video games like StarCraft II, and Dota 2, (Vinyals et al., 2019; Berner et al., 2019), and robotics (Andrychowicz et al., 2020) . However, the requirement of extensive environmental interaction combined with a need for exploratory behavior makes these algorithms unsuitable and potentially unsafe for many real world applications. In contrast, in the offline setting (Fu et al., 2020; Fujimoto et al., 2018; Gulcehre et al., 2020; Levine et al., 2020) , also known as batch RL (Ernst et al., 2005; Lange et al., 2012) , agents learn from a fixed dataset which is assumed to have been logged by other (possibly unknown) agents. See also Fig. 1 for an illustration of these two settings. Learning purely from logged data allows these algorithms to be more widely applicable, including in problems such as healthcare and self-driving cars, where repeated interaction with the environment is costly and potentially unsafe or unethical, and where logged historical data is abundant. However these algorithms tend to behave considerably worse than their online counterpart. Although similar in principle, there are some important differences between the two regimes. While it is useful for online agents to explore unknown regions of the state space so as to gain knowledge about the environment and better their chances of finding a good policy (Schmidhuber, 1991) , this is not the case for the offline setting. Choosing actions not well-represented in the dataset for Figure 1 : In online RL (left), the agent must interact with the environment to gather data to learn from. In offline RL (right), the agent must learn from a logged dataset. offline methods would force the agent to rely on function approximators' extrapolation ability. This can lead to substantial errors during training, as well as during deployment of the agent. During training, the extrapolation errors are exacerbated by bootstrapping and the use of max operators (e.g. in Q-learning) where evaluating the loss entails taking the maximum over noisy and possibly overestimated values of the different possible actions. This can result in a propagation of the erroneous values, leading to extreme over-estimation of the value function and potentially unbounded error; see (Fujimoto et al., 2019b) and our remark in Appendix A. As we empirically show in Section 4.2, extrapolation errors are a different source of overestimation compared to those considered by standard methods such as Double DQN (Hasselt, 2010) , and hence cannot be addressed by those approaches. In addition to extrapolation errors during training, a further degradation in performance can result from the use of greedy policies at test time which maximize over value estimates extrapolated to under-represented actions. We propose a coherent set of techniques that work well together to combat extrapolation error and overestimation: Behavior value estimation. First, we address extrapolation errors during training time. Instead of Q π * , we estimate the value of the behavioral policy Q π B , thereby avoid the max-operator during training. To improve upon the behavioral policy, we conduct what amounts to a single step of policy improvement by employing a greedy policy at test time. Surprisingly, this technique with only one round of improvement allows us to perform significantly better than the behavioral policies and often outperform existing offline RL algorithms. Ranking regularization. We introduce a max-margin based regularizer that encourages the value function, represented as a deep neural network, to rank actions present in the observed rewarding episodes higher than any other actions. Intuitively, this regularizer pushes down the value of all unobserved state-action pairs, thereby minimizing the chance of a greedy policy selecting actions under-represented in the dataset. Employing the regularizer during training will minimize the impact of the max-operator used by the greedy policy at test time, i.e. this approach addresses extrapolation errors both at training and (indirectly) at test time. Reparametrization of Q-values. While behavior value estimation typically performs well, particularly when combined with ranking regularization, it only allows for one iteration of policy improvement. When more data is available, and hence we can trust our function approximator to capture more of the structure of the state space and as a result generalize better, we can rely on Q-learning which permits multiple policy improvement iterations. However this exacerbates the overestimation issue. We propose, in addition to the ranking loss, a simple reparametrization of the value function to disentangle the scale from the relative ranks of the actions. This reparametrization allows us to introduce a regularization term on the scale of the value function alone, which reduces over-estimation. To evaluate our proposed method, we introduce new datasets based on bsuite environments (Osband et al., 2019) , as well as the partially observable DeepMind Lab environments (Beattie et al., 2016) . We further evaluate our method as well as baselines on the RL Unplugged (RLU) Atari dataset (Gulcehre et al., 2020) . We achieve a new state of the art (SOTA) performance on the RLU Atari dataset as well as outperform existing SOTA offline RL methods on our newly introduced datasets. Last but not least, we provide careful ablations and analyses that provide insights into our proposed method as well as other existing offline RL algorithms.

Related work.

Early examples of offline/batch RL include least-squares temporal difference methods (Bradtke & Barto, 1996; Lagoudakis & Parr, 2003) and fitted Q iteration (Ernst et al., 2005; Riedmiller, 2005) . Recently, Agarwal et al. (2019a ), Fujimoto et al. (2019b) , Kumar et al. (2019) , Siegel et al. (2020) , Wang et al. (2020) and Ghasemipour et al. (2020) have proposed offline-RL algorithms and shown that they outperform off-the-shelf off-policy RL methods. There also exist methods explicitly addressing the issues stemming from extrapolation error (Fujimoto et al., 2019b) .

2. BACKGROUND AND PROBLEM STATEMENT

We consider, in this work, Markov Decision Processes (MDPs) defined by (S, A, P, R, ρ 0 , γ) where S is the set of all possible states and A all possible actions. An agent starts in some state s 0 ∼ ρ 0 (•) where ρ 0 (•) is a distribution over S and takes actions according to its policy a ∼ π(•|s), a ∈ A, when in state s. Then it observes a new state s and reward r according to the transition distribution P (s |s, a) and reward function r(s, a). The state action value function Q π describes the expected discounted return starting from state s and action a and following π afterwards: Q π (s, a) = E t=0 γ t r(s t , a t ) , s 0 = s, a 0 = a, s t ∼ P (•|s t-1 , a t-1 ), a t ∼ π(•|s t ), and V π (s) = E a∼π(•|s) Q π (s, a) is the state value function. The optimal policy π * , which we aim to discover through RL, is one that maximizes the expected cumulative discounted rewards, or expected returns such that Q π * (s, a) ≥ Q π (s, a) ∀s, a, π. For notational simplicity, we denote the policy used to generate an offline dataset as π Bfoot_0 . In the same vein, for a state s in an offline dataset, we write G B (s) to denote an empirical estimate of V π B (s), computed by summing future discounted rewards over the trajectory that s is part of. Approaches to RL can be broadly categorized as either on-policy or off-policy algorithms. Whereas on-policy algorithms update their current policy based on data generated by that same policy, offpolicy approaches can take advantage of data generated by other policies. Algorithms in the mold of fitted Q-iteration make up many of the most popular approaches to deep off-policy RL (Mnih et al., 2015; Lillicrap et al., 2015; Haarnoja et al., 2018) . This class of algorithms learns a Q function by minimizing the Temporal Difference (TD) error. To increase stability and sample efficiency, the use of experience replay is also typically employed. For example, DQN (Mnih et al., 2015) minimizes the following loss function: L θ (θ) = E (s,a,r,s )∼D Q θ (s, a) -r + γ max a Q θ (s , a ) 2 , where D represents experience replay, i.e. a dataset generated by some behavior policy. Typically, for off-policy algorithms the behavior policy is periodically updated to remain close to the policy being optimized. A deterministic policy can be derived by being greedy with respect to Q, i.e. by defining π(s) = arg max a Q(s, a). In cases where maximization is nontrivial (e.g. continuous action spaces), we typically adopt a separate policy π and optimize losses similar to: L θ (θ) = E (s,a,r,s )∼D Q θ (s, a) -r + γE a ∼π(•|s ) [Q θ (s , a )]

2

. In this case, π is optimized separately in order to maximize E a∼π(•|s) [Q(s, a)], sometimes subject to other constraints (Lillicrap et al., 2015; Haarnoja et al., 2018) . Various extensions have been proposed for this class of algorithms, including but not limited to: distributional critics (Bellemare et al., 2017) , prioritized replays (Schaul et al., 2015) , and n-step returns (Kapturowski et al., 2019; Barth-Maron et al., 2018; Hessel et al., 2017) . In the offline RL setting (see Figure 1 , right), agents learn from fixed datasets generated via other processes, thus rendering off-policy RL algorithms particularly pertinent. Many existing offline RL algorithms adopt variants of Equation (2) to learn value functions; e.g. Agarwal et al. (2019b) . Offline RL, however, is different from off-policy learning in the online setting. The dataset used is finite and fixed, and does not track the policy being learned. When a policy moves towards a part of the state space not covered by the behavior policy(s), for example, one cannot effectively learn the value function. We will explore this in more detail in the next subsection.

2.1. EXTRAPOLATION AND OVERESTIMATION IN OFFLINE RL

In the offline setting, when considering all possible actions for a next state in Equation ( 2), some of the actions will be out-of-distribution (OOD), i.e. these actions were never picked in that particular state by the behavior policy used to construct the training set (hence not present in the data). In such circumstances, we have to rely on the current Q-network's ability to extrapolate beyond the training data, resulting in extrapolation errors when evaluating the loss. Moreover, the need for extrapolation can lead to value overestimation, as explained below. Value overestimation (see Fig. 2 ) happens when the function approximator predicts a larger value than the ground truth. In short, taking the max over actions of several Q-network predictions, as in Equation ( 2), leads to overconfident estimates of the true value of the state. We will expand on this point shortly. Before doing so, it is worth pointing out that this phenomenon of overestimation was well-studied in the online setting (Van Hasselt et al., 2015; 2018) and some prior works sought to address this problem (Van Hasselt et al., 2015; Fujimoto et al., 2018) . However, in offline RL overestimation manifests itself in more problematic ways, which cannot be addressed by the solutions proposed in online RL (Kumar et al., 2019) . To see this, let us consider Equation (2) again. The max operator is used to evaluate Q θ for all actions in a given state, including actions absent in the dataset (OOD actions). For OOD actions, we depend on extrapolated values provided by Qθ . While being an extremely powerful family of models, neural networks will produce erroneous predictions on unobserved state-action pairs, and sometimes, these will be artificially high. These errors will be propagated in the value of other states via bootstrapping. Due to the smoothness of neural networks, by increasing the value of actions in the OOD action-state's neighborhood, the overestimated value itself might increase, creating a vicious loop. Mainly we remark that, in such a scenario, typical gradient descent optimization can diverge and escape towards infinity. See Appendix A for a formal statement, and proof on this statement, though similar observations had been made before by Fujimoto et al. (2019b) and Achiam et al. (2019) . In the online setting, when the agent overestimates some state-action pairs, they will be chosen more often due to optimistic estimates of values, even in the off-policy setting where the behavior policy trails the learned one. The online agent would then act, collect data, thereby correcting extrapolation errors. This form of self-correction is absent in the offline setting, and due to the overestimation from extrapolation, this absence can be catastrophic.

3. SOLUTIONS TO ADDRESS EXTRAPOLATION ERROR

We build towards a solution to the extrapolation errors by i) using behavior value estimation to reduce training time extrapolation error, ii) ranking regularization of the Q-networks to better handle test time extrapolation error, iii) reparameterizing the Q-function to prevent divergence of these predictions to infinity.

3.1. BEHAVIOR VALUE ESTIMATION

One potential answer to the overestimation problem is to remove the max-operator in the policy evaluation step by optimizing the alternative loss: L θ (θ) = E (s,a,r,s ,a )∼D Q θ (s, a) -r + γQ θ (s , a ) 2 . This update rule relies on transitions (s, a, r, s , a ) collected by the behavior policy π β and resembles the policy evaluation step of SARSA (Rummery & Niranjan, 1994; Van Seijen et al., 2009) . Since the update contains no max-operator, and Q θ are evaluated only on state-action pairs that are part of the dataset, the learning process is not affected by overestimation. However, the removal of the max-operator means the update simply tries to evaluate the value of the behavioral policy. The astute reader may question our ability to improve upon the behavioral policy when using this update rule. We note, that when acting using the greedy policy π(s) = arg max a Q θ (s, a) we are in fact performing a single policy improvement step. Fortunately, this one step is typically sufficient for dramatic gains as we show in our experiments (see for example Fig. 9 ). This finding matches our understanding that policy iteration algorithms typically do not require more than a few steps to converge to the optimal policy (Lagoudakis & Parr, 2003; Sutton & Barto, 2018, Chapter 4.3) .

3.2. RANKING REGULARIZATION

Policy evaluation with Eq. (3) effectively reduces overestimation during training. But it also avoids learning the Q values of OOD actions. Due to the lack of learning, these values are likely erroneous, and many will err on the side of overestimation, thus harming the greedy policy. This is in contrast with the tabular case, where all OOD actions will have a default value of 0. To robustify the policy improvement step, a natural choice is to regularize the function approximator such that it behaves more predictable on unseen inputs. Forcing the neural network to output 0 for OOD actions might require very non-smooth behavior of the network-hence we choose a less harsh regularizer that asks the model only to assign lower values to state-action pairs that have not been observed during learning. We formulate this as a ranking loss which follows a typical hinge-loss approximation (Chen et al., 2009; Burges et al., 2005) for ranking problems. Given a transition from the dataset (s t , a t ) this can be formulated as C(θ) = |A| i=0,i =t max (Q θ (s t , a i ) -Q θ (s t , a t ) + ν, 0) 2 . ( ) While equation ( 4) does, in expectation, encourage lower ranks for OOD action, it can also have the adverse effect of promoting suboptimal behavior that is frequent in the dataset. This is because for any transition, proportional to its frequency in the dataset, the regularizer pushes the value of all but the selected action down, promoting a policy that picks the selected action regardless of its value. To minimize this effect, we weigh the regularization based on the value of the trajectory: C(θ) = exp G B (s) -E s∼D [G B (s)] /β |A| i=0,i =t max (Q θ (s, a i ) -Q θ (s, a t ) + ν, 0) 2 , ( ) where E s∼D [G B (s)] is estimated by average over G B (s) in mini-batches. In all our experiments, we fix ν to be 5e -2 and β to be 2. The new formulation of the loss ensures that particularly on trajectories performed well in the dataset, trajectories that are likely for policy learned using behavior evaluation, the OOD action rank lower than observed actions. We note that our rank loss, when viewed through the lens of on-policy online RL, can be related to ranking policy gradients (Lin & Zhou, 2020) or to (Su et al., 2020; Pohlen et al., 2018) who also used a hinge loss as a regularizer but with different goals from ours.

3.3. REPARAMETRIZATION OF Q-VALUES

The overestimation of state-action values can be severe in offline RL, and can escape towards infinity (see for example appendix A). While behavior value estimation can be an effective way for suppressing this overestimation, when one iteration of policy improvement is insufficient, one may want to bring back the max-operator and therefore the implicit policy improvement step of Q-learning. To better handle this scenario, we introduce a complimentary method to prevent severe over-estimation by bounding the values predicted by the critic via reparameterization. Specifically, we reparameterize the critic as Q θ (s, a) = α Qθ (s, a) given a state-and action-independent scale parameter α. This, in effect, disentangles the scale from the relative magnitude of values predicted, but also enables us to impose constraints on the scale parameter. To further stabilize the learning and reduce the variance of the estimations, we update α by stochastic gradient descent, but with larger minibatches and a smaller learning rate. In our formulation, the "standardized" value Qθ (s, a) ∈ [-1, 1] is attained by using a tanh activation function. Note that the tanh activation has the side effect of reducing numerical resolution for representing extreme values (as the tanh will be in its saturated regime), minimizing the ability of the learning process to keep growing these values by bootstrapping on each other. We let α = exp(ρ) such that α > 0. Our parameterization thus ensures that Q-values are always bounded in absolute value by α, i.e. n Q(s, a) ∈ [-α, α]. The equations below show how critic scaling can be adapted into the Q-learning objective: L θ ,α (θ, α) = E (s,a,r,s )∼D α Qθ (s, a) -r + γα max a Qθ (s , a ) 2 . ( ) The introduction of α allows us to conveniently regularize the scale of Q values without disturbing the ranking between actions. More precisely, we introduce a regularization term on α C(α) = E[softplus(α Qθ (s, a) -G B (s)) 2 ], where C(α) represents a soft-constraint requiring Q values to stay close to the performance of the behavioral policy, and thereby prevent gross overestimation. In Eq. ( 7), we rely on the softplus function to constrain α only when Q θ (s, a) > G B (s).

4. EXPERIMENTS

We investigate the performance of discrete offline RL algorithms on the three aforementioned opensource domains: Atari, DeepMind Lab, and bsuite. A question we are particularly interested in answering is: how does the lack of coverage of the state-action pairs affect the performance of each algorithm? In that context, we study each algorithms' robustness to dataset size (see Fig. 6 ), noise (see Fig. 3 and 7 ), and reward distribution (Fig. 9 in Appendix), as they all affect the datasets' coverage of the state and action space. Because we explore various ablations of our proposed approach, discussed in Section 3, we used a specific acronym for each potential combination. According to our naming convention, Q is for Q-learning and B is for behavior value estimation as the underlying RL loss, R indicates the use of the ranking regularization and r the use of reparametrization. In that vein, QRr refers to Q-learning with ranking regularization and reparametrization and BR stands for behavior value estimation with ranking regularization (see Appendix B.1.) We note that, both our DQN and R2D2 experiments used Double Q-learning (Van Hasselt et al., 2015) , but for our approach (and ablations of it) that rely on Q-learning, we used the vanilla Q-learning algorithm. More details for each experimental setup appear in Appendix D. We also provide more analysis and additional results in Appendix B. We used an open-source Atari offline RL dataset, which is a part of RL Unplugged (Gulcehre et al., 2020) benchmark suite. We have created two new offline RL datasets for bsuite and DeepMind Lab, which we are going to opensource. The details of those datasets are provided in Appendix C. bsuite (Osband et al., 2019 ) is a proposed benchmark designed to highlight key aspects of agent scalability such as exploration, memory, credit assignment, etc. We have generated low-coverage offline RL datasets for catch and cartpole as described by Agarwal et al. (2019a) (see Appendix C.1 for details). In Fig. 3 , we compare the performance of BRr and QRr with four baselines: DDQN (Hasselt, 2010), CQL (Kumar et al., 2020) , REM (Agarwal et al., 2019a) and BCQ (Fujimoto et al., 2018) . We consider two tasks, each in five versions defined by the amount of injected noise. The noise is injected into transitions by replacing the actions from an agent with a random action with probability . On the harder dataset (cartpole), BRr, the proposed method, outperforms all other approaches showing the efficiency of our approach and its robustness to noise. Two other methods, QRr (proposed by us as an ablation to BRr) and CQL, also perform relatively well. The results for catch are similar, with the exception that BCQ also improves performance which re-emphasises the importance of restricting behavior to stay close to the observed data. We have additional results on mountain car, where most algorithms behave well except DDQN (see Appendix D.4).

4.2. ATARI EXPERIMENTS

Atari is an established online RL benchmark (Bellemare et al., 2013) , which has recently attracted the attention of the offline RL community (Agarwal et al., 2019a; Fujimoto et al., 2019a) arguably because the diversity of games presents a challenge for offline RL methods. Here, we used the experimental protocol and datasets from the RL Unplugged Atari benchmark (Gulcehre et al., 2020) . We report the median normalized score across the Atari games, and the error bars show a bootstrapped estimate of the [25, 75] percentile interval for the median estimate computed across different games. In Fig. 4 , we show that QRr outperforms all baselines reported in the RL Unplugged benchmark as well as CQL (Kumar et al., 2020) . While BRr performs well, this experiment highlights the potential limitation of doing a single policy improvement iteration in rich data regimes. Because in the considered setting there is enough data for the neural networks to learn reasonable approximations of the Q-value (exploiting the structure of the state space to extrapolate for unobserved state-action pairs), one can gain more by reverting to Q-learning in order to do multiple policy improvement steps. However this amplifies the role of the regularization and in particular the reparametrization. Therefore, in this setting, QRr, which we proposed as an ablation to BRr outperforms other techniques. Fig. 4 We show various ablation studies (in terms of using regularization, reparametrization and behavior value estimation). We found the most significant improvement from the ranking regularization term. Although the combination of ranking regularization and the reparameterization performs the best. Ablation Experiments on Atari We ablate three different aspects of our algorithm on online policy selection games: i) the choice of TD backup updates (Q-learning or behavior value estimation), ii) the effect of ranking regularization, iii) the reparameterization on the critic. We show the ablation of those three components in Fig. 5 . We observed the largest improvement when using ranking regularization. In general, we found that estimating the Monte-Carlo returns directly from the value function (we refer this in our plot as "MC Learning") does not work on Atari. However, behavior value estimation and Q-learning both have similar performance on the full dataset, however in low data regimes, the behavior value policy considerably outperforms Q-learning (see Fig. 9 in Appendix B). Overestimation Experiments Q-learning can over-estimate due to the maximization bias, which happens due to the max-operator in the backups (Hasselt, 2010) . In the offline setting another source of overestimation, as discussed in Section 2, are OOD actions due to the dataset's limited coverage. Double DQN (DDQN by Hasselt ( 2010)) is supposed to address the first problem, but it is unclear whether it can address the second. In Fig. 6 , we show that in the offline setting DDQN still over-estimates severely when we evaluate the critic's predictions in the environment. We believe this is because the second factor is the main reason of overestimation, which is not explicitly addressed by DDQN. However, Qr (vanilla Q-learning with reparametrization) and B are not effected from the reduced dataset size and coverage as much. In the figure, we compute the over-estimation error as 1 100 100 i=0 (max(Q π (s, a) -G π (s), 0)) 2 over 100 episodes, where G π (s) corresponds to the discounted sum of rewards from state s till the end of episode by following the policy π. Robustness Experiments In Appendix B.3 (see Figure 9 ), we investigate the robustness of B and DDQN with respect to the reward distribution and dataset sizes. We found that the performance of B is more robust than DDQN to the variations on the reward distribution and the dataset size. 

4.3. DEEPMIND LAB EXPERIMENTS

Offline RL research mainly focused on fully observable environments such as Atari. However, in a complex partially observable environment such as Deepmind Lab, it is very difficult to obtain good coverage in the dataset even after collecting billions of transitions. To highlight this, we have generated datasets by training an online R2D2 agent on DeepMind Lab levels. Specifically, we have generated datasets for four of the levels: explore object rewards many, explore object rewards few, rooms watermaze, and rooms select nonmatching object. The details of the datasets are provided in the Appendix C.2. We compare offline R2D2, CQL, BC, B and BR on our DeepMind Lab datasets. In contrast to Atari, BR performed better than QR according to our preliminary results. Thus, here, we decided to only focus on BR. We use the same network architecture and hence, the models vary only is the loss function. We want to compare our baselines' performance on our Deepmind Lab datasets when there is a large amount of data stored during online training with online R2D2. In Figure 7 , we show the performance of each algorithm on different levels. Our proposed modifications, BR and B outperform other offline RL approaches on all DeepMind Lab levels. We argue that poor performance of R2D2 in the offline setting is due to the low coverage of the dataset. Despite having on the order 300M transitions, since the environment is partially observable and diverse, it is still not enough to cover enough of all possible state-action pairs. We present further results about dataset coverage on Deepmind Lab seekavoid arena 01 level with dataset generated by a fixed policy in Appendix B.2 where we showed that BR is more robust to the dataset coverage than other offline RL methods.

5. DISCUSSION

In this work, we first highlight how, in the offline deep RL setting, overestimation errors may cause Q-learning to diverge, with weights and Q-value escaping towards infinity. We discuss using behavior value estimation to address this problem, which efficiently regresses to the Q-value of the behavior policy and then takes a policy improvement step at test time by acting greedily with respect to the learnt Q-value. The behavior value estimation oversteps the overestimation issue by avoiding the max-operator during training. We note that a single policy improvement step seems sufficient, especially in the low data regime, to improve over the behavior policy and the policy discovered by double DQN. However, the max-operator used to construct the test time policy re-introduces overestimation errors that were avoided during training. We can address this issue by regularizing the function approximator with a ranking loss that encourages OOD actions to rank lower than the observed actions. This reduces overestimation at test time and improves performance. Nevertheless, we observe that behavior value estimation can be too conservative in rich data settings. In such scenarios, the function approximator can exploit more of the state and action space's underlying structure, leading to more reliable extrapolation. Therefore, it can be more lucrative to rely on Q-learning in such scenarios, which can do multiple policy improvement steps, further constraining the function approximator. The resulting algorithm QRr, that is Q-learning with the ranking loss and reparametrization, outperforms all other approaches on the RL Unplugged Atari benchmark. Overall behavior value estimation coupled with the ranking loss, is an effective algorithm for low data regimes. For larger data regimes, where the coverage is better, it is possible to achieve better performance by switching to Q-learning and using reparametrization. The proposed methods outperform existing offline RL approaches on the considered benchmarks. As future work, we plan to extend our observations to the continuous control setup and towards more real-world applications.

A Q-LEARNING CAN ESCAPE TO INFINITY IN THE OFFLINE CASE

Remark 1. Q-learning, using neural networks as a function approximator, can diverge in the offline RL setting given that the collected dataset does not include all possible state-actions pairs, even if it contains all transitions along optimal paths. Furthermore, the parameters (and hence the Q-values themselves) can espace towards infinity under gradient descent dynamics. Proof. The proof relies on providing a particular instance where Q-learning diverges towards infinity. This is sufficient to show that divergence can happen. Note that the remark does not make any statement of how likely is for this to happen, nor is providing sufficient conditions under which such divergence has to happen. Let us consider a simple deterministic MDP depicted in the figure below (left). u3 u2 h2 h3 1 -1 a 2 , r = 0 a 1 , r = 0 s2 s1 s3 s4 a 0 , r = 1 a 2 , r = 0 S a[0] a[1] a[2] 1 w u1 Depiction of the MDP -2 -2 -2 Depiction of the MLP h1 1 h4 a2, r=0 a0 or a1, r=0 a0 or a1, r=0 S = {s 1 , s 2 , s 3 , s 4 } is the set of all states, where S 1 is deterministically the starting state and S 4 is the terminal state of the MDP. Let A = {a 0 , a 1 , a 2 } be the set of all possible actions. Let the reward function r(s, a) be 0 for all action-state pair except r(s 1 , a 0 ) which is 1. Let the transition probabilities P (s |s, a) be deterministic as defined by the depicted arrows. I.e. for any state action pair only transitioning to one state has probability 1, while the rest has probability 0. For example, only P (s 4 |s 1 , a 0 ) = 1, while P (s 3 |s 1 , a 0 ) = 0, P (s 2 |s 1 , a 0 ) = 0.P (s 2 |s 1 , a 0 ) = 0. For s 1 , a 1 only P (s 2 |s 1 , a 1 ) = 1 and so on and so forth. First observation is that the optimal behavior is to pick action a 0 (as it is the only rewarding transition in the entire MDP). The features describing each state are given by a single real number, where s 1 = 0, s 2 = 1, s 3 = β, with β > 1 γ > 0, where γ is the discount factor. Assume actions are provided to the neural network as one-hot vectors, i.e. a 0 = [0, 0, 1] T , a 1 = [0, 1, 0] T , a 2 = [1, 0, 0] Tfoot_1 , where we will refer to a[i] as the i-th element of the vector that represents the action a. For example a 0 [2] = 1 and a 0 [0] = 0. Let us consider the Q-function parametrized as a simple MLP (depicted in the figure above left). The MLP uses rectifier activations, and gets as input both the state and action, returning a single scalar value which is the Q-value for that particular state action combination. Rewriting the diagram in analytical form we have that for s ∈ R and a ∈ R 3 : Q θ (s, a) = w • relu(s) + u 1 relu(a[0] -2s) + u 2 relu(a[1] -2s) + u 3 relu(-2s -a[2]) (8) A note on initialization. The weights of the first layer are given as constants. The process would work if we leave them to be learnable as well, but the analysis would become considerably harder. The exact value used, -2, 1, -1, are not important. In principle we care for the negative weights connecting s to h 2 , h 3 , h 4 be larger in magnitude than those from a[i] to h i , and we care for the weight between a[2] and h 2 to be negative. They can be scaled arbitrarily small and do not need to be identical. What we will rely in the rest of the analysis is that the preactivation of h 2 , h 3 , h 4 to be negative for state s 2 and s 3 . This will be in the zero region of the rectifier, meaning no gradient will flow through those units. Since s 3 > s 2 ≥ 1 and a[i] ∈ {0, 1}, it is sufficient for the weight from s to h 2 , h 3 , h 4 to be larger in magnitude than the weight from a[i] to h 2 , h 3 , h 4 . This ensures that for s > 1, the Q-function is not a function of u i as u i will get multiplied by 0.foot_2 Also we want the function to never depend on u 3 to simplify our analysis, which is easily achievable if the weight going from a[2] to h 4 is negative. Given the observations above, if we plug in the formula the different values of s i and a i we get that: Q θ (s 1 , a 0 ) = u 1 Q θ (s, 1, a 1 ) = u 2 Q θ (s 1 , a 2 ) = 0 ∀a ∈ A, Q θ (s 2 , a) = w ∀a ∈ A, Q θ (s 3 , a) = βw Note that this implies that max a Q θ (s 2 , a) = w max a Q θ (s 3 , a) = βw Assume w > 0. And let the dataset collected by the behavior policy to contain the following 3 transitions: D = {(s 1 , a 0 , 1, s 4 ), (s 1 , a 1 , 0, s 2 ), (s 2 , a 2 , 0, s 3 )} We can now construct the Q-learning loss that we will use to learn the function Q in the offline case which will be L = (s,a,r,s )∈D (Q θ (s, a) -r -γmax a Q θ (s , a)) 2 = (Q θ (s 1 , s 0 ) -1) 2 + (Q θ (s 1 , a 1 ) -γ max a Q θ (s 2 , a)) 2 + (Q θ (s 2 , a 2 ) -γ max a Q θ (s 3 , a)) 2 = (u 1 -1) 2 + (u 2 -γw ) 2 + (w -γβw ) 2 Note that we relied on Eq. ( 10) to evaluate the max operator and θ is a copy of θ, that is used for bootstrapping. This is the standard definition of Q-learning see Eq. ( 2). In particular in this toy example θ is numerically always identical to θ (in general it can be a trailing copy of θ from k steps back) and is used more to indicate that when we take a derivative of the loss with respect to θ we do not differentiate through Q θ . From Eq. ( 11) we notice that only the first transition in dataset contributes to the gradient of u 1 , only the second transition contributes to the gradient of u 2 and only the third transition contributes to the gradient of w. We can not evaluate the gradient with respect to θ of the loss L over the entire dataset: ∇ u1 = u 1 -1 ∇ u2 = u 2 -(0 + γw) ∇ u3 = w -(0 + γβw) = (1 -γβ)w ∇ w = w -(0 + γβw) = (1 -γβ)w Note that we assumed w > 0 and for simplicity we exploited that w = w numerically, to be able to better understand the dynamics of the update. Given that β > 1 γ , ∇ w will always be negative as long as w (and implicitly w ) stays positive. Given that w t = w t-1 -α∇ w for some learning rate α > 0, the update creates a vicious loop that will increase the norm of w at every iterations, such that lim t→∞ w t = ∞. Given that the gradient on u 2 tracks w, it means that the path that takes action a 2 in the initial state s 1 will have +∞ as value. Note that all transitions along the optimal path of this deterministic MDP are part of the dataset. Also that given our example, the same will happen if we rely on SGD rather than batch GD (as the different examples affect different parameters of the model independently and there is no effect from averaging). Preconditioning the updates (as for e.g. is done by Adam or rmsprop) will also not change the result as they will not affect the sign of the gradient (the preconditioning matrix needs to be positive definite). Neither momentum will not affect the divergence of learning, as it will not affect the sign of the update. This means that the provided MDP will diverge towards infinity under the updates on most commonly used gradient based algorithms.

B ADDITIONAL RESULTS AND ABLATIONS B.1 ACRONYMS

In Table 1 , we provided the acronyms for our models and their corresponding meanings. As mentioned in Section 4.3 we investigate the effect of coverage on the DeepMind Lab seekavoid arena 01 level. To do so, we have created another set of datasets which is generated by using a fixed R2D2 snapshot with different noise levels when evaluating the trained snapshot in the environment for seekavoid arena 01 level. We have used different s in the -greedy algorithm to create datasets with different noise levels. The also effects the coverage of the dataset. We compare R2D2, CQL, BC, B and BR on these DeepMind Lab datasets by using the same network architecture-the only change among the models is the loss function. We investigate the effect of coverage in the Deep-Mind Lab seekavoid arena 01 level, by evaluating the policy with different 's for the epsilon-greedy in the environment and storing each episode in the dataset. Increasing the will increase the coverage of the dataset but also it will increase the noise in the dataset as well. In Figure 8 , we show the effect on the simple DeepMind Lab level. When = 0, BC works outperforms offline RL approaches, however increasing the level of noise deteriorates the performance of BC, and BR starts to performs better. Since the environment is deterministic, if the policy is deterministic as well, this corresponds to only having one single unique episode in the dataset. As we increase the epsilon the coverage in the dataset and diversity of the trajectories will increase as well. We trained all models by unrolling on the whole episode and trained using back-propagation through time.

B.3 ATARI: ROBUSTNESS TO DATA

The robustness of the reward distribution in the dataset is an important feature required to deploy offline RL algorithms in the real-world. We would like to understand the robustness of behavior value estimation in the offline RL setting. Thus, we first investigate the robustness of B in contrast to Q-learning with respect to the datasets' size and the reward distribution. In Fig. 9 , we split out the dataset into two smaller datasets: i) transitions coming from only highly rewarding ii) transitions from only poorly performing episodes. We show that B outperforms Q-learning in both settings. We compare DQN and B in terms of their robustness to the reward distribution on Atari online policy selection games. We split the datasets in two bins: the dataset that only contains transitions that are coming from episodes that have episodic return less than the mean episodic return in the dataset ("Episodic Reward < Mean"), transitions coming from episodes with return higher than the mean return in the dataset ("Episodic Reward > Mean"). B performs better than DQN in both cases. (right) Normalized scores of DQN and B on subsets of data from online policy selection games. B performs comparatively better than DQN . The Q-learning suffers more since the coverage of the dataset reduces with the subsampling which causes more severe extrapolation error.

B.4 ON THE EFFECT OF REGULARIZATION

In this Section we study the effect of the regularization on the action gap and the overestimation error. In Figure 10 , we show that increasing the regularization co-efficient for the ranking regularization increases the action gap across the Atari online policy selection games which can result to lower estimation error and better optimization. In Figure 11 , we show the effect of increasing the regularization on the overestimation of the Qnetwork when evaluated in the environment. We show the mean over-estimation across the games.

B.5 ONLINE POLICY SELECTION GAMES RESULTS

In Figure 12 , we show the performance of different models with respect to the rewards they achieve over the training. 

B.6 OVERESTIMATION ON ONLINE POLICY SELECTION GAMES

In Figure 13 and 14, we report the value error of B, BRr and DQN's value error and squared value error respectively. 

C DETAILS OF DATASETS

C.1 BSUITE DATASET BSuite (Osband et al., 2019) data was collected by training DQN agents (Mnih et al., 2015) with the default setting in Acme (Hoffman et al., 2020) from scratch in each of the three tasks: cartpole, catch, and mountain car. We convert the originally deterministic environments into stochastic ones by randomly replacing the agent action with a uniformly sampled action with a probability of ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} (ie. = 0 corresponds to the original environment). We train agents (separately for each randomness level and 5 seeds, i.e. 25 agents per game) for 1000, 2000, 500 episodes in cartpole, catch and mountain car respectively. The number of episodes is chosen so that agents in all levels can reach their best performance. We record all the experience generated through the training process. Then to reduce the coverage of the datasets and make them more challenging we only used 10% of the data by subsampling it. More details of the dataset are provided in Table 2 . The results presented in the paper are averaged over the 5 random seeds.

C.2 DEEPMIND LAB DATASET

DeepMind Lab (Beattie et al., 2016) data was collected by training distributed R2D2 (Kapturowski et al., 2019) agents from scratch on individual tasks. First, we tuned the hyperparameters of a distributed version of the Acme (Hoffman et al., 2020) R2D2 agent independently for every task to achieve fast learning in terms of actor steps. Then, we recorded the experience across all actors during entire training runs a few times for every task. Training was stopped after there was no further progress in learning across all runs, with a resulting number of steps for each run between 50 million for the easiest task (seekavoid arena 01) and 200 million for some of the hard tasks. Finally we built a separate offline RL dataset for every run and every task. See more details about these datasets in Table 3 . Additionally, for the seekavoid arena 01 task we ran two fully trained snapshots of our R2D2 agents on the environment with different levels of noise ( = 0, 0.01, 0.1, 0.25 for -greedy action selection). We recorded all interactions with the environment and generated a different offline RL dataset containing 10 million actor steps for every agent and every value of . 

D EXPERIMENT DETAILS

We used the Adam optimizer (Kingma & Ba, 2014) for all our experiments. For details on the used hyperparameters, refer to the Table 4 for bsuite, Table 5 for Atari, and Table 6 for DeepMind Lab. Our evaluation protocol is described below, in Section D.1. On Atari experiments, we have normalized the agents' scores as described in (Gulcehre et al., 2020) . On Atari, in all our experiments we report the median normalized score along with the bootstraps estimates of 75th and 25th percentiles for the interquantile range estimates of the errors in the error bars as done by (Gulcehre et al., 2020) . Atari Hyperparameters: On Atari we directly used the baselines and the hyperparameters reported in (Gulcehre et al., 2020) , to get the detailed Atari results on test set we communicated with the authors. We have run additional CQL and our own models with ranking regularization and reparameterization. For CQL we have finetuned both the learning rate from the grid [8e -5, 1e -4, 3e -4] and the regularization hyperparameter α ∈ [0.005, 0.05, 0.01, 0.1, 1]. For our own proposed models we have only tuned the learning rate from the grid [8e -5, 1e -4, 3e -4] and the ranking regularization hyperparameter from the grid [0.005, 0.05, 0.01, 0.1, 1]. We have fixed the rest of the hyperparameters. As mentioned earlier, we have only used the online policy selection games for finetuning the hyperparameters. As a result of our grid search, we have used learning rate of 1e -4 for CQL and our models. We have used 0.01 for the α hyperparameter of CQL. 0.05 seems to be the optimal hyperparameter choice for the ranking regularization hyperparameter. DeepMind Lab Hyperparameters: On DeepMind Lab experiments, we tuned the hyperparameters of each model individually on each level separately. We have tuned the learning rate and the regularization hyperparameters for each model from the same grid that we have used for Atari. All our algorithms are n-step in DeepMind Lab experiments, where n is fixed to 5 in all our experiments. Thus both behavior value estimation and Q-learning experiments use 5 steps of unrolls for learning.

D.1 EVALUATION PROTOCOL

To evaluate the performance of the various methods, we use the following protocol: 1. We sweep over a small (5-10) sets of hyperparameter values for each of the methods. 2. We independently train each of the models on 5 datasets generated by running the behavior policy with 5 different seeds (ie. producing 25-50 runs per problem setting and method). 3. We evaluate the produced models in the original environments (without the noise). 4. We average the results over seeds and report the results of the best hyperparameter for each method.

D.1.1 EVALUATION METHOD

To evaluate models (step 3. above), in the case of bsuite and DeepMind Lab we ran an evaluation job in parallel to the training one. It repeatedly read the learner's checkpoint and produced evaluation results during training. We report the average of the evaluation scores over the last 100 learning steps. In the case of the Atari environments, instead of averaging performance during the final steps of learning, we take the final snapshot produced by a given method and evaluate it on a '100' environment steps after the training finished. 

D.4 B S U I T E DETAILED RESULTS

We generated datasets and performed experiments analogous to these in Section 4.1 for mountain car environment. We present results for all three environments in Table 9 . BRr outperforms all the baselines.

E REPARAMETRIZING THE Q-NETWORK

In all reparameterized critic experiments we have used the tanh(•) activation function with refine gates to help with optimization (Gu et al., 2019) . We have not tuned the hyperparameters of the As seen in Algorithm 1, there is a two stage of optimization to update the parameters of Q-network θ and the scale of the Q values α. They both use different learning rates, it is important to make sure that we update the α with a smaller learning rate: η 2 ≤ η 1 .

F RANKING REGULARIZER

We propose a family of methods that prevent the extrapolation error by suppressing the values of the actions that are not in the dataset. We achieve that by ranking the actions in the training set higher than the ones that are not in the training set. For the learned Q-function the absolute values of actions do not matter, we are rather interested in relative ranking of the actions. Given a t is the action from the dataset. For all j = t and illustration purposes, the value iteration can be written as: where ξ is an irreducible noise, because we can not gather additional data on (s t , a j ), and we don't know the corresponding reward for it. This causes extrapolation error which accumulates through the bootstrapping in the backups as noted by Kumar et al. (2019) . We implicitly pull down the P (Q(s, a t ) ≯ Q(s, a t )) by ranking the actions in the dataset higher which pushes up P (Q(s, a t ) > Q(s, a j )). As a result, the extrapolation error in Q-learning would also reduce.

F.1 PAIRWISE RANKING LOSS FOR Q-LEARNING

In this section, we discus the relationship between the pairwise ranking loss for Q-learning and the list-wise pairwise ranking losses. We use a common approximation (Chen et al., 2009; Burges et al., 2005) to the softplus-based log-likelihood is to use a hinge-loss which can be seen as an approximation: It is possible to drive the foirmulation that we use for the ranking regularizer from the policy gradient theorem to show the relationship. The Ranking Policy Gradient Theorem formulates the optimization of long-term reward using a ranking objective as done in Lin & Zhou (2020) . The proof below illustrates the formulation process. Let us note that we apply the ranking regularization on the offline and off-policy data, such that thee formalism below only works when the behavior policy and target policy are equivalent, when the transitions are coming from on-policy data. If the ranking regularizer is used on the on-policy data it approximates the policy gradients, but it will not on the off-policy data. C(θ) = Our construction is based on direct policy differentiation (Peters & Schaal, 2008; Williams, 1992) where the objective function is to θ * = arg max θ J(θ).  where the trajectory is a series of state-action pairs from t = 1, ..., T , i.e. τ = (s 1 , a 1 , s 2 , a 2 , ..., s T ). The gradients in ( 19) is exactly the gradients of the ranking regularizer.



Our proposed approach does not depend on πB being a coherent policy. one-hot representation is the typical representation for action in discrete spaces The fact that no gradient gets propagated in the first layer is only important if we attempt to consider the case when the first layer weights are learnable.



Figure 2: Two types of extrapolation error. Type A is most dangerous for offline RL, due to the max operation. Type B is difficult to address without additional interactions with the environment. Here, we aim to address Type A extrapolation errors.

Figure3: Bsuite Experiments: bsuite experimental results on two environments with respect to different levels of noise injected into the actions in the dataset. The proposed method, BRr, outperforms all the baselines on cartpole. Methods implementing a form of behavior constraining (BCQ, CQL and our methods BRr and QRr) excel on catch, stressing its importance.

Figure 6: Overestimation of Q-values with subsampled Atari datasets (% of Dataset). DDQN over-estimates the value of states severely whereas Qr and B reduce over-estimation greatly. We report median over-estimation error over online policy selection games on Atari.

Figure 7: DeepMind Lab Results: We compare the performance of different baselines on challenging DeepMind Lab datasets coming from four different DeepMind Lab levels. Our method, BR, consistently performs the best.

Figure8: Effect of coverage in the dataset: We compare offline RL models with varying the noise level in the environment. Increasing the noise level increases the coverage as well. BC performs well with low noise, however, BR performs significantly better as the noise increases. Let us note that, in all our experiments, R2D2 uses double Q-learning.

Figure9: Robustness Experiments: (left) We compare DQN and B in terms of their robustness to the reward distribution on Atari online policy selection games. We split the datasets in two bins: the dataset that only contains transitions that are coming from episodes that have episodic return less than the mean episodic return in the dataset ("Episodic Reward < Mean"), transitions coming from episodes with return higher than the mean return in the dataset ("Episodic Reward > Mean"). B performs better than DQN in both cases. (right) Normalized scores of DQN and B on subsets of data from online policy selection games. B performs comparatively better than DQN . The Q-learning suffers more since the coverage of the dataset reduces with the subsampling which causes more severe extrapolation error.

Figure 10: The Effect of increasing the ranking regularization on the action gap.

Figure 12: The Raw Returns obtained by each baseline on Atari online Policy Selection Games.

Figure 13: The value error computed in the environment by evaluating the agent and computed with respect to the ground truth discounted returns. The negative values indicate under-estimation and positive values are for over-estimation.

Figure 14: The squared value error computed in the environment by evaluating the agent and computed with respect to the ground truth discounted returns and reporting the mean squared values of the values.

Figure 15: DeepMind Lab Reward Distribution: We show the reward distributions for the DeepMind Lab datasets. The vertical red line indicates the average episodic return in the datasets.

a)] ≈ E [P (Q(s, at) > Q(s, aj))Q(s, at)|t ∈ M ax] + E [P (Q(s, at) ≯ Q(s, aj))Q(s, aj)|j ∈ M ax] = E [P (Q(s, at) > Q(s, aj))Q(s, at)|t ∈ M ax] + E [(1 -P (Q(s, at) > Q(s, aj)))Q(s, aj)|j ∈ M ax] = αE P ( Q(s, at) > Q(s, aj)) Q(s, at)|t ∈ M ax + αE (1 -P ( Q(s, at) > Q(s, aj))) Q(s, aj)|j ∈ M ax = α E P ( Q(s, at) > Q(s, aj)) Q(s, at)|t ∈ M ax + E (1 -P ( Q(s, at) > Q(s, aj))) ξ

tj = sigm( Q) θ (s, a t ) -Qθ (s, a j )) ( Qθ (s, a t ) -Qθ (s, a j ))) = |A| i=0softplus( Qθ (s, a j ) -Qθ (s, a t ))

|A|i=0,i =t max Qθ (s, a) -Qθ (s, a t ) + ν, 0 2 (13) Imposing the constraint in Equation (13) can be harmful if the dataset has lots of suboptimal trajectories. Because this constraint will try to maximize the values of suboptimal actions in the dataset. As a result, similar toWang et al. (2020), we propose a filtering function to impose that constraints only on rewarding transitions:C(θ) = exp(G B (s) -E s∼D [G B (s)]) |A| i=0,i =t max Qθ (s, a i ) -Qθ (s, a t ) + ν,

∇ θ J(θ) =∇ θ τ p θ (τ )G B (s) (15) = τ p θ (τ )∇ θ log p θ (τ )G B (s) = τ p θ (τ )∇ θ log p(s 0 )Π T t=1 π θ (a t |s t )p(s t+1 |s t , a t ) G B (s) θ log π θ (a t |s t )G B (s) =E τ ∼π θ T t=1 ∇ θ log π θ (a t |s t )G B (s) =E τ ∼π θ rectifier (Q(s, a i ) -Q(s, a j )) G B (s) ,(17)with baseline E s∼D [G B (s)] it will be, s,a i ) -Q(s, a j )) G B (s) -E s∼D [G B (s)](18)Then we apply the exp(•) transformation on (G B (s) -E s∼D [G B (s)] to impose this loss loss mostly on the rewarding trajectories, and we can turn the maximization problem to a minimization one with a flip of sign: s, a i ) -Q(s, a j )) exp G B (s) -E s∼D [G B (s)]

Atari results: We compare our proposed QRr results against other recent State of Art offline RL methods on the Atari offline policy selection games from RL Unplugged benchmark.

also shows the robustness of QRr's hyperparameters to different tasks.

Acronyms for our models and their expansions

BSuite dataset details.

DeepMind Lab dataset details. For training data, reward is measured as the maximum over training of the average reward over runs for the same task. For snapshot data, reward is just an average over all episodes recorded using the same level of noise.

bsuite experiments' hyperparameters. The top section of the table corresponds to the shared hyperparameters of the offline RL methods and the bottom section of the table contrasts the hyperparameters of Online vs Offline DQN.

Atari experiments' hyperparameters. The top section of the table corresponds to the shared hyperparameters of the offline RL methods and the bottom section of the table contrasts the hyperparameters of Online vs Offline DQN.

Deepmind Lab experiments' hyperparameters. The top section of the table corresponds to the shared hyperparameters of the offline RL methods and the bottom section of the table contrasts the hyperparameters of Online vs Offline DQN.

we show the performance of our baselines on different Atari Offline Policy selection games. We show that QRr outperforms other approaches significantly.

Atari Offline Policy Selection Results: In this table, we list the median normalized performance of different baselines.

we have shown the results on the Deepmind Lab datasets. It is possible see from these numerica results that BR outperforms other approaches and B is still very competitive.

Detailed Results on the DeepMind Lab: We provide the detailed results for each DeepMind levels along with the standard deviations.

BSuite mean results. reparameterization in our experiments, we have used four times larger minibatches to update the scale, since it is cheap to update a single scalar and as shown in Algorithm 1, we have used twice smaller learning rate to update the scale than the rest of the parameters of the network. This is a heuristic, but we found this simple heuristic to work well across all the tasks that we have tried. Potentially it is possible to get better results by tuning the hyperparameters for reparameterization more carefully. Algorithm 1 Algorithm of Reparametrized Q-Network Inputs: Dataset of trajectories D, batch size to update θ: B1, batch size to update γ: B2, and number of actors A. Initialize Q weights θ. Initialize α to 1. Initialize target policy weights θ ← θ. for nsteps do Sample transition sequences (st:t+m, at:t+m, rt:t+m) from dataset D to construct a mini-batch of size B. Calculate loss L(st, at, rt, st+1; θ, α) using target network. Update θ with GD: θ ← θ -η1∇ θ L(θ) Update α with GD: α ← α -η1 B1/B2∇γL(γ) If t mod ttarget = 0, update the target weights and α, θ ← θ, α ← α. end for

