DECORRELATED DOUBLE Q-LEARNING

Abstract

Q-learning with value function approximation may have the poor performance because of overestimation bias and imprecise estimate. Specifically, overestimation bias is from the maximum operator over noise estimate, which is exaggerated using the estimate of a subsequent state. Inspired by the recent advance of deep reinforcement learning and Double Q-learning, we introduce the decorrelated double Q-learning (D2Q). Specifically, we introduce Q-value function utilizing control variates and the decorrelated regularization to reduce the correlation between value function approximators, which can lead to less biased estimation and low variance. The experimental results on a suite of MuJoCo continuous control tasks demonstrate that our decorrelated double Q-learning can effectively improve the performance.

1. INTRODUCTION

Q-learning Watkins & Dayan (1992) as a model free reinforcement learning approach has gained popularity, especially under the advance of deep neural networks Mnih et al. (2013) . In general, it combines the neural network approximators with the actor-critic architectures Witten (1977) ; Konda & Tsitsiklis (1999) , which has an actor network to control how the agent behaves and a critic to evaluate how good the action taken is. The Deep Q-Network (DQN) algorithm Mnih et al. (2013) firstly applied the deep neural network to approximate the action-value function in Q-learning and shown remarkably good and stable results by introducing a target network and Experience Replay buffer to stabilize the training. Lillicrap et al. proposes DDPG Lillicrap et al. (2015) , which extends Q-learning to handle continuous action space with target networks. Except the training stability, another issue Q-learning suffered is overestimation bias, which was first investigated in Thrun & Schwartz (1993) . Because of the noise in function approximation, the maximum operator in Q-learning can lead to overestimation of state-action values. And, the overestimation property is also observed in deterministic continuous policy control Silver & Lever (2014) . In particular, with the imprecise function approximation, the maximization of a noisy value will induce overestimation to the action value function. This inaccuracy could be even worse (e.g. error accumulation) under temporal difference learning Sutton & Barto (1998) , in which bootstrapping method is used to update the value function using the estimate of a subsequent state. Given overestimation bias caused by maximum operator of noise estimate, many methods have been proposed to address this issue. Double Q-learning van Hasselt (2010) mitigates the overestimation effect by introducing two independently critics to estimate the maximum value of a set of stochastic values. Averaged- DQN Anschel et al. (2017) takes the average of previously learned Q-values estimates, which results in a more stable training procedure, as well as reduces approximation error variance in the target values. Recently, Twin Delayed Deep Deterministic Policy Gradients (TD3) Fujimoto et al. (2018) extends the Double Q-learning, by using the minimum of two critics to limit the overestimated bias in actor-critic network. A soft Q-learning algorithm Haarnoja et al. (2018) , called soft actor-critic, leverages the similar strategy as TD3, while including the maximum entropy to balance exploration and exploitation. Maxmin Q-learning Lan et al. (2020) proposes the use of an ensembling scheme to handle overestimation bias in Q-Learning. This work suggests an alternative solution to the overestimation phenomena, called decorrelated double Q-learning, based on reducing the noise estimate in Q-values. On the one hand, we want to make the two value function approximators as independent as possible to mitigate overestima-tion bias. On the other hand, we should reduce the variance caused by imprecise estimate. Our decorrelated double Q-learning proposes an objective function to minimize the correlation of two critics, and meanwhile reduces the target approximation error variance with control variate methods. Finally, we provide experimental results on MuJoCo games and show significant improvement compared to competitive baselines. The paper is organized as follows. In Section 2, we introduce reinforcement learning problems, notations and two existed Q-learning variants to address overestimation bias. Then we present our D2Q algorithm in Section 3 and also prove that in the limit, this algorithm converges to the optimal solution. In Section 4 we show the experimental results on MuJoCo continuous control tasks, and compare it to the current state of the art. Some related work and discussion is presented in Section 5 and finally Section 6 concludes the paper.

2. BACKGROUND

In this section, we introduce the reinforcement learning problems and Q-learning, as well as notions that will be used in the following sections.

2.1. PROBLEM SETTING AND NOTATIONS

We consider the model-free reinforcement learning problem (i.e. optimal policy existed) with sequential interactions between an agent and its environment Sutton & Barto (1998) in order to maximize a cumulative return. At every time step t, the agent selects an action a t in the state s t according its policy and receives a scalar reward r t (s t , a t ), and then transit to the next state s t+1 . The problem is modeled as Markov decision process (MDP) with tuple: (S, A, p(s 0 ), p(s t+1 |s t , a t ), r(s t , a t ), γ). Here, S and A indicate the state and action space respectively, p(s 0 ) is the initial state distribution. p(s t+1 |s t , a t ) is the state transition probability to s t+1 given the current state s t and action a t , r(s t , a t ) is reward from the environment after the agent taking action a t in state s t and γ is discount factor, which is necessary to decay the future rewards ensuring finite returns. We model the agent's behavior with π θ (a|s), which is a parametric distribution from a neural network. Suppose we have the finite length trajectory while the agent interacting with the environment. The return under the policy π for a trajectory τ = (s t , a t ) T t=0 J(θ) = E τ ∼π θ (τ ) [r(τ )] = E τ ∼π θ (τ ) [R T 0 ] = E τ ∼π θ (τ ) [ T t=0 γ t r(s t , a t )] where π θ (τ ) denotes the distribution of trajectories, p(τ ) = π(s 0 , a 0 , s 1 , ..., s T , a T ) = p(s 0 ) T t=0 π θ (a t |s t )p(s t+1 |s t , a t ) The goal of reinforcement learning is to learn a policy π which can maximize the expected returns θ = arg max θ J(θ) = arg max E τ ∼π θ (τ ) [R T 0 ] The action-value function describes what the expected return of the agent is in state s and action a under the policy π. The advantage of action value function is to make actions explicit, so we can select actions even in the model-free environment. After taking an action a t in state s t and thereafter following policy π, the action value function is formatted as: Q π (s t , a t ) = E si∼pπ,ai∼π [R t |s t , a t ] = E si∼pπ,ai∼π [ T i=t γ (i-t) r(s i , a i )|s t , a t ] To get the optimal value function, we can use the maximum over actions, denoted as Q * (s t , a t ) = max π Q π (s t , a t ), and the corresponding optimal policy π can be easily derived by π * (s) ∈ arg max at Q * (s t , a t ).

2.2. Q-LEARNING

Q-learning, as an off-policy RL algorithm, has been extensively studied since it was proposed Watkins & Dayan (1992) . Suppose we use neural network parametrized by θ Q to approximate Q-value in the continuous environment. To update Q-value function, we minimize the follow loss: L(θ Q ) = E si∼pπ,ai∼π [(Q(s t , a t ; θ Q ) -y t ) 2 ] ( ) where y t = r(s t , a t ) + γ max at+1 Q(s t+1 , a t+1 ; θ Q ) is from Bellman equation, and its action a t+1 is taken from frozen policy network (actor) to stabilizing the learning. In actor-critic methods, the policy π : S → A, known as the actor with parameters θ π , can be updated through the chain rule in the deterministic policy gradient algorithm Silver & Lever ( 2014) ∇J(θ π ) = E s∼pπ [∇ a Q(s, a; θ Q )| a=π(s;θ π ) ∇ θ π (π(s; θ π ))] where Q(s, a) is the expected return while taking action a in state s, and following π after. One issue has attracted great attention is overestimation bias, which may exacerbate the situation into a more significant bias over the following updates if left unchecked. Moreover, an inaccurate value estimate may lead to poor policy updates. To address it, Double Q-learning van Hasselt (2010) use two independent critics q 1 (s t , a t ) and q 2 (s t , a t ), where policy selection uses a different critic network than value estimation q 1 (s t , a t ) = r(s t , a t ) + γq 2 (s t+1 , arg max at+1 q 1 (s t+1 , a t+1 ; θ q1 ); θ q2 ) q 2 (s t , a t ) = r(s t , a t ) + γq 1 (s t+1 , arg max at+1 q 2 (s t+1 , a t+1 ; θ q2 ); θ q1 ) Recently, TD3 Fujimoto et al. (2018) uses the similar two q-value functions, but taking the minimum of them below: y t = r(s t , a t ) + γmin q 1 (s t+1 , π(s t+1 )), q 2 (s t+1 , π(s t+1 )) Then the same square loss in Eq. 5 can be used to learn model parameters.

3. DECORRELATED DOUBLE Q-LEARNING

In this section, we present Decorrelated Double Q-learning (D2Q) for continuous action control with attempt to address overestimation bias. Similar to Double Q-learning, we use two q-value functions to approximate Q(s t , a t ). Our main contribution is to borrow the idea from control variates to decorrelate these two value functions, which can further reduce the overestimation risk. 3.1 Q-VALUE FUNCTION Suppose we have two approximators q 1 (s t , a t ) and q 2 (s t , a t ), D2Q uses the weighted difference of double q-value functions to approximate the action-value function at (s t , a t ). Thus, we define Q-value as following: Q(s t , a t ) = q 1 (s t , a t ) -β q 2 (s t , a t ) -E(q 2 (s t , a t )) where q 2 (s t , a t )-E(q 2 (s t , a t )) is to model the noise in state s t and action a t , and β is the correlation coefficient of q 1 (s t , a t ) and q 2 (s t , a t ). To understand the expectation E(q 2 (s t , a t )), it is the average over all possible runs. Thus, the weighted difference between q 1 (s t , a t ) and q 2 (s t , a t ) attempts to reduce the variance and remove the noise effects in Q-learning. To update q 1 and q 2 , we minimize the following loss: L(θ Q ) = E si∼pπ,ai∼π [(q 1 (s t , a t ; θ q1 ) -y t ) 2 ] + E si∼pπ,ai∼π [(q 2 (s t , a t ; θ q2 ) -y t ) 2 ] + λE si∼pπ,ai∼π [corr(q 1 (s t , a t ; θ q1 ), q 2 (s t , a t ; θ q2 ))] 2 where θ Q = {θ q1 , θ q2 }, and y t can be defined as y t = r(s t , a t ) + γQ(s t+1 , a t+1 ) where Q(s t+1 , a t+1 ) is the action-value function defined in Eq. 8 to decorrelate q 1 (s t+1 , a t+1 ) and q 2 (s t+1 , a t+1 ), which are both from the frozen target networks. In addition, we want these two qvalue functions as independent as possible. Thus, we introduce corr(q 1 (s t , a t ; θ q1 ), q 2 (s t , a t ; θ q1 )), which measures similarity between these two q-value approximators. In the experiment, our method using Eq. 10 can get good results on Halfcheetah, but it did not perform well on other MuJoCo tasks. To stabilize the target value, we take the minimum of Q(s t+1 , a t+1 ) and q 2 (s t+1 , a t+1 ) in Eq. 10 as TD3 Fujimoto et al. (2018) . Then, it gives the target update of D2Q algorithm below y t = r(s t , a t ) + γmin(Q(s t+1 , a t+1 ), q 2 (s t+1 , a t+1 )) And the action a t+1 is from policy a t+1 = π(s t+1 ; θ π ), which can take a similar policy gradient as in Eq. 6. Our D2Q leverages the parametric actor-critic algorithm, which maintains two q-value approixmators and a single actor. Thus, the loss in Eq. 9 tries to minimize the three terms below, as corr(q 1 (s t , a t ; θ q1 ), q 2 (s t , a t ; θ q2 )) → 0 q 1 (s t , a t ; θ q1 ) → y t q 2 (s t , a t ; θ q2 ) → y t At each time step, we update the pair of critics towards the minimum target value in Eq. 11, while reducing the correlation between them. The purposes that we introduce control variate q 2 (s t , a t ) are following: (1) Since we use q 2 (s t , a t ) -E(q 2 (s t , a t )) to model noise, if there is no noise, such that q 2 (s t , a t ) -E(q 2 (s t , a t )) = 0, then we have y t = r(s t , a t ) + min(Q π (s t , a t ), q 2 (s t , a t )) = r(s t , a t ) + min(q 1 (s t , a t ), q 2 (s t , a t )) via Eq. 11, which is exactly the same as TD3. (2) In fact, because of the noise in value estimate, we have q 2 (s t , a t ) -E(q 2 (s t , a t )) = 0. The purpose we introduce q 2 (s t , a t ) is to mitigate overestimate bias in Q-learning. The control variate introduced by q 2 (s t , a t ) will reduce the variance of Q(s t , a t ) to stabilize the learning of value function. Convergence analysis: we claim that our D2Q algorithm is to converge the optimal in the finite MDP settings. There is existed theorem in Jaakkola et al. (1994) , given the random process {∆ t } taking value in R n and defined as ∆ t+1 (s t , a t ) = (1 -α t (s t , a t ))∆ t (s t , a t ) + α t (s t , a t )F t (s t , a t ) Then ∆ t converges to zero with probability 1 under the following assumptions: 1. 0 < α t < 1, t α t (x) = ∞ and t α 2 t (x) < ∞ 2. ||E[F t (x)|F t ]|| W ≤ γ||∆ t || W + c t with 0 < γ < 1 and c t p → 0 = 1 3. var[F t (x)|F t ] ≤ C(1 + ||∆ t || 2 W ) for C > 0 where F t is a sequence of increasing σ-field such that α t (s t , a t ) and ∆ t are F t measurable for t = 1, 2, .... Based on the theorem above, we provide sketch of proof which borrows heavily from the proof of convergence of Double Q-learning and TD3 as below: Firstly, the learning rate α t satisfies the condition 1. Secondly, variance of r(s t , a t ) is limit, so condition 3 holds. Finally, we will prove that condition 2 holds below. ∆ t+1 (s t , a t ) = (1 -α t (s t , a t ))(Q(s t , a t ) -Q * (s t , a t )) + α t (s t , a t ) r t + γ min(Q(s t , a t ), q 2 (s t , a t )) -Q * (s t , a t ) = (1 -α t (s t , a t ))∆ t (s t , a t ) + α t (s t , a t )F t (s t , a t ) where F t (s t , a t ) is defined as: F t (s t , a t ) = r t + γ min(Q(s t , a t ), q 2 (s t , a t )) -Q * (s t , a t ) = r t + γ min(Q(s t , a t ), q 2 (s t , a t )) -Q * (s t , a t ) + γQ(s t , a t ) -γQ(s t , a t ) = r t + γQ(s t , a t ) -Q * (s t , a t ) + γ min(Q(s t , a t ), q 2 (s t , a t )) -γQ(s t , a t ) = F Q t (s t , a t ) + c t Since we have E[F Q t (s t , a t )|F t ] ≤ γ||∆ t || under Q-learning, so the condition 2 holds. Then we need to prove c t = min(Q(s t , a t ), q 2 (s t , a t )) -Q(s t , a t ) converges to 0 with probability 1. min(Q(s t , a t ), q 2 (s t , a t )) -Q(s t , a t ) = min(Q(s t , a t ), q 2 (s t , a t )) -q 2 (s t , a t ) + q 2 (s t , a t ) -Q(s t , a t ) = min(Q(s t , a t ) -q 2 (s t , a t ), 0) -(Q(s t , a t ) -q 2 (s t , a t )) = min(q 1 (s t , a t ) -q 2 (s t , a t ) -β(q 2 (s t , a t ) -E(q 2 (s t , a t ))), 0) + q 1 (s t , a t ) -q 2 (s t , a t ) -β(q 2 (s t , a t ) -E(q 2 (s t , a t ))) Suppose there exists very small δ 1 and δ 2 , such that |q 1 (s t , a t ) -q 2 (s t , a t )| ≤ δ 1 and |q 2 (s t , a t ) -E(q 2 (s t , a t ))| ≤ δ 2 , then we have min(Q(s t , a t ), q 2 (s t , a t )) -Q(s t , a t ) ≤2(|q 1 (s t , a t ) -q 2 (s t , a t )| + β|q 2 (s t , a t ) -E(q 2 (s t , a t ))|) =2(δ 1 + βδ 2 ) < 4δ where δ = max(δ 1 , δ 2 ). Note that ∃δ 1 , |q 1 (s t , a t ) -q 2 (s t , a t )| ≤ δ 1 holds because ∆ t (q 1 , q 2 ) = |q 1 (s t , a t ) -q 2 (s t , a t )| converges to zero. According Eq. 9, both q 1 (s t , a t ) and q 2 (s t , a t ) are updated with following q t+1 (s t , a t ) = q t (s t , a t ) + α t (s t , a t )(y t -q t (s t , a t )) Then we have ∆ t+1 (q 1 , q 2 ) = ∆ t (q 1 , q 2 ) -α t (s t , a t )∆ t (q 1 , q 2 ) = (1 -α t (s t , a t ))∆ t (q 1 , q 2 ) converges to 0 as the learning rate satisfies 0 < α t (s t , a t ) < 1.

3.2. CORRELATION COEFFICIENT

The purpose we introduce corr(q 1 (s t , a t ), q 2 (s t , a t )) in Eq. 9 is to reduce the correlation between two value approximators q 1 and q 2 . In other words, we hope q 1 (s t , a t ) and q 2 (s t , a t ) to be as independent as possible. In this paper, we define corr(q 1 , q 2 ) as: corr(q 1 (s t , a t ), q 2 (s t , a t )) = cosine(f q1 (s t , a t ), f q2 (s t , a t )) where cosine(a, b) is the cosine similarity between two vectors a and b. f q (s t , a t ) is the vector representation of the last hidden layer in the value approximator q(s t , a t ). In other words, we constrain the hidden representation learned from q 1 (s t , a t ) and q 2 (s t , a t ) in the loss function, with attempt to make them independent. According to control variates, the optimal β in Eq. 8 is: β = cov(q 1 (s t , a t ), q 2 (s t , a t )) var(q 1 (s t , a t )) where cov is the symbol of covariance, and var represents variance. Considering it is difficult to estimate β in continuous action space, we take an approximation here. In addition, to reduce the number of hyper parameters, we set β = corr(q 1 (s t , a t ), q 2 (s t , a t )) in Eq. 8 to approximate the correlation coefficient of q 1 (s t , a t ) and q 2 (s t , a t ) since it is hard to get covariance in the continuous action space.

3.3. ALGORITHM

We summarize our approach in Algorithm. 1. Similar to Double Q-learning, we use the target networks with a slow updating rate to keep stability under temporal difference learning. Our contributions are two folder: (1) introduce the loss to minimize the correlation between two critics, which can make q 1 (s t , a t ) and q 2 (s t , a t ) as random as possible, and then effectively reduce the overestimation risk; (2) add control variates to reduce variance in the learning procedure.

4. EXPERIMENTAL RESULTS

In this section, we evaluate our method on the suite of MuJoCo continuous control tasks. We downloaded the OpenAI Gym environment, and used the MuJoCo v2 version of all tasks to test our The quantitative results over 5 trials are presented in Table 1 . Compared to SAC Haarnoja et al. (2018) , our approach shows better performance with lower variance given the same size of training samples. It demonstrates that our approach can yield competitive results, compared to TD3 and DDPG. Specifically, our D2Q method outperforms all other algorithms with much low variance on Ant, HalfCheetah, InvertedDoublePendulum and Walker2d. In the Hopper task, our method achieve maximum reward competitive with the best methods such as TD3, with comparable variance.

5. RELATED WORK

Q-learning can suffer overestimation bias because it uses the maximum to estimate the maximum expected value. To address the overestimation issue Thrun & Schwartz (1993) in Q-learning, many approaches have been proposed to avoid the maximization operator of a noisy value estimate. Delayed Q-learning Strehl et al. (2006) tries to find -optimal policy, which determines how frequent to update state-action function. However, it can suffer from overestimation bias, although it guarantees to converge in polynomial time. Double Q-learning van Hasselt (2010) introduces two indepen- Another side effect of consistent overestimation Thrun & Schwartz (1993) in Q-learning is that the accumulated error of temporal difference Sutton & Barto (1998) can cause high variance. To reduce the variance, there are two popular approaches: baseline and actor-critic methods Witten (1977) ; Konda & Tsitsiklis (1999) . In policy gradient, we can minus baseline in Q-value function to reduce variance without bias. Further, the advantage actor-critic (A 2 C) Mnih et al. (2016) introduces the average value to each state, and leverages the difference between value function and the average to update the policy parameters. Schulman et al proposed the generalized advantage value estimation Schulman et al. (2016) , which considered the whole episode with an exponentially-weighted estimator of the advantage function that is analogous to T D(λ) to substantially reduce the variance of policy gradient estimates at the cost of some bias. From another point of view, baseline and actor-critic methods can be categories into control variate methods Greensmith et al. (2001) . In this paper, we present a novel variant of Double Q-learning to constrain possible overestimation. We limit the correlation between the pair of q-value functions, and also introduce the control variates to reduce variance and improve performance.

6. CONCLUSION

In this paper, we propose the Decorrelated Double Q-learning approach for off-policy value-based reinforcement learning. We use a pair of critics for value estimate, but we introduce a regularization term into the loss function to decorrelate these two approixmators. While minimizing the loss function, it constrains the two q-value functions to be as independent as possible. In addition, considering the overestimation derived from the maximum operator over positive noise, we leverage control variates to reduce variance and stabilize the learning procedure. The experimental results on a suite of challenging tasks in the continuous control environment demonstrate our approach yields on par or better performance than competitive baselines. Although we leverage control variates in our q-value function, we approximate the correlation coefficient with a simple strategy based on the similarity of these two q-functions. In the future work, we will consider a better estimation of correlation coefficient in control variate method.

A APPENDIX

We add additional experiments on how our model will perform by varying λ in this Appendix. We set λ = [1, 2, 5, 10] respectively to run 1 Million steps and evaluate its performance every 5000 steps, while keeping all other parameters same. Figure 4 : The figures show how our method will perform while adjusting parameter λ. The shaded region represents the standard deviation of the average evaluation over nearby windows with size 10.



Figure 1: The learning curves with exploration noise on Reacher and Humanoid environments. The shaded region represents the standard deviation of the average evaluation over nearby windows with size 10. On the MuJoCo tasks, our D2Q algorithm yields competitive results, compared to TD3 and DDPG.

Comparison of Max Average Return over 5 trials of 1 million samples. The maximum value is marked bold for each task. ± corresponds to a single standard deviation over trials. to mitigate the overestimation effect. Averaged-DQN Anschel et al. (2017)   takes the average of previously learned Q-values estimates, which results in a more stable training procedure, as well as reduces approximation error variance in the target values. A clipped Double Qlearning calledTD3 Fujimoto et al. (2018)  extends the deterministic policy gradientSilver & Lever (2014);Lillicrap et al. (2015) to address overestimation bias. In particular, TD3 uses the minimum of two independent critics to approximate the value function suffering from overestimation. Soft actor criticHaarnoja et al. (2018) takes a similar approach as TD3, but with better exploration with maximum entropy method. Maxmin Q-learningLan et al. (2020) extends Double Q-learning and TD3 to multiple critics to handle overestimation bias and variance.

Greensmith et al.  analyze the two additive control variate methods theoretically including baseline and actor-critic method to reduce the variance of performance gradient estimates in reinforcement learning problems. Interpolated policy gradient (IPG)Gu et al.  (2017)  based on control variate methods merges on-and off-policy updates to reduce variance for deep reinforcement learning. Motivated by the Stein's identity, Liu et al. introduce more flexible and general action-dependent baseline functionsLiu et al. (2018) by extending the previous control variate methods used in REINFORCE and advantage actor-critic.

annex

Algorithm 1 Decorrelated Double Q-learning Initialize a pair of critic networks q 1 (s, a; θ q1 ), q 2 (s, a; θ q2 ) and actor π(s; θ π ) with weights θ Q = {θ q1 , θ q2 } and θ π Initialize corresponding target networks for both critics and actor θ Q and θ π ; Initialize the total number of episodes N , batch size and the replay buffer R Initialize the coefficient λ in Eq. 9 Initialize the updating rate τ for target networks for episode = 1 to N doReceive initial observation state s 0 from the environment for t = 0 to T do Select action according to a t = π(s t ; θ π ) + , ∼ N (0, σ) method. We compared our approach against the state of the art off-policy continuous control algorithms, including DDPG, SAC and TD3. Since SAC requires the well-tuned hyperparameters to get the maximum reward across different tasks, we used the existed results from its training logs published by its authors. To obtain consistent results, we use the author's implementation for TD3 and DDPG. In practice, while we minimize the loss in Eq. 9, we constrain β ∈ (0, 1). In addition, we add Gaussian noise to action selected by the target policy in Eq. 11. Specifically, the target policy adds noise as a t+1 = π(s t+1 ; θ π ) + , where = clip(N (0, σ), -c, c) with c = 0.5.Without other specification, we use the same parameters below for all environments. The deep architecture for both actor and critic uses the same networks as TD3 Fujimoto et al. (2018) , with hidden layers [400, 300, 300] . Note that the actor adds the noise N (0, 0.1) to its action space to enhance exploration and the critic networks have two Q-functions q 1 (s, a) and q 2 (s, a). The minibatch size is 100, and both network parameters are updated with Adam using the learning rate 10 -3 . In addition, we also use target networks including the pair of critics and a single actor to improve the performance as in DDPG and TD3. The target policy is smoothed by adding Gaussian noise N (0, 0.2) as in TD3, and both target networks are updated with τ = 0.005. We set the balance weight λ = 2 for all tasks except Walker2d which we set λ = 10. In addition, the off-policy algorithm uses the replay buffer R with size 10 6 for all experiments.We run each task for 1 million time steps and evaluate it every 5000 time steps with no exploration noise. We repeat each task 5 times with random seeds and get its mean and standard deviation respectively. And we report our evaluation results by averaging the returns with window size 10. The evaluation curves are shown in Figures 1, 2 and 3. Our D2Q consistently achieves much better performance than TD3 on most continuous control tasks, including InvertedDoublePendulum, Walker2d, Ant, Halfcheetah and Hopper environments. Other methods such as TD3 perform well on one task Reacher, but perform poorly on other tasks compared to our algorithm.We also evaluated our approach on high dimensional continuous action space task. The Humanoid-v2 has 376 dimensional state space and 17 dimensional action space. In the task, we set the learning rate on Humanoid to be 3 × 10 -4 , and compared to DDPG and TD3. The result in Figure 1 (b) demonstrates that our performance on this task is on a par with TD3.

