DEEP REINFORCEMENT LEARNING ON ADAPTIVE PAIRWISE CRITIC AND ASYMPTOTIC ACTOR

Abstract

Maximum entropy deep reinforcement learning has displayed great potential on a range of challenging continuous tasks. The maximum entropy is able to encourage policy exploration, however, it has a tradeoff between the efficiency and stability, especially when employed on large-scale tasks with high state and action dimensionality. Sometimes the temperature hyperparameter of maximum entropy term is limited to remain stable at the cost of slower and lower convergence. Besides, the function approximation errors existing in actor-critic learning are known to induce estimation errors and suboptimal policies. In this paper, we propose an algorithm based on adaptive pairwise critics, and adaptive asymptotic maximum entropy combined. Specifically, we add a trainable state-dependent weight factor to build an adaptive pairwise target Q-value to serve as the surrogate policy objective. Then we adopt a state-dependent adaptive temperature to smooth the entropy policy exploration, which introduces an asymptotic maximum entropy. The adaptive pairwise critics can effectively improve the value estimation, preventing overestimation or underestimation errors. Meanwhile, the adaptive asymptotic entropy can adapt to the tradeoff between efficiency and stability, which provides more exploration and flexibility. We evaluate our method on a set of Gym tasks, and the results show that the proposed algorithms have better performance than several baselines on continuous control.

1. INTRODUCTION

The task of deep reinforcement learning (DRL) is to learn good policies by optimizing a discounted cumulative reward through function approximation. In DRL, the maximization over all noisy Q-value estimates at every update tends to prefer inaccurate value approximation that outweighes the true value Thrun & Schwartz (1993) , i.e., the overestimation. This error further accumulates and broadcasts via bootstrapping of temporal difference learning Sutton & Barto (2018) , which estimates the value function using the value estimate of a subsequent state. When function approximation is unavoidably adopted in the actor-critic setting on continuous control, the estimation errors are exaggerated. These errors may cause suboptimal policies, divergence and instability. To some extent, the inaccurate estimation is unavoidable in DRL because it is the basic trait for value-involved DRL to use random variables as target values. On the one hand, these stochastic target values will introduce some estimation biases. On the other hand, even an unbiased estimate with high variance can still lead to future overestimation in local regions of state space, which in turn can negatively affect the global policy Fujimoto et al. (2018) . Therefore, diminishing the value variance without partiality can be an effective means to reduce estimation errors, no matter overestimation or underestimation. Taking the twin delayed deep deterministic policy gradient (TD3) Fujimoto et al. (2018) for example, always selecting the lower value from a pair of critics will induce an underestimation bias although it is beneficial for lower variance. Several of recent works deal with errors like bootstrapping error caused by out-of-distribution (OOD) actions Kumar et al. (2019; 2020) , and extrapolation error induced by the mismatch between the distribution of buffer-sampled data and true state-action visitation of the current policy Fujimoto et al. (2019) . The authors in Wu et al. (2019) address the distribution errors by extra value penalty or policy regularization. Overestimation is another induced errors, which was originally found in Q-learning algorithm by Watkins (1989) , and was demonstrated in deep Q-network (DQN) Mnih et al. (2015) on discrete control. In recent years, overestimation is reported in function approximation of actor-critic methods on continuous control Fujimoto et al. (2018) ; Duan et al. (2021a) . Although several algorithms are created to address the overestimation errors Fujimoto et al. (2018; 2019) ; Kumar et al. (2019) ; Wu et al. (2019) ; Duan et al. (2021a) , the accuracy of function approximation is not flexibly touched since underestimation errors usually accompanies the correction to overestimation. The paper has the following contributions. First, we propose the concept of adaptive pairwise critics, which connects a pair of critics using a trainable state-dependent weight factor, to combat estimation errors. Second, we propose the adaptive temperature which is also state-dependent so that the agent can freely explore with loose restriction on the selection of temperature hyperparameter. Based on this adaptive temperature, we organize a term of asymptotic maximum entropy to optimize the policy. The asymptotic maximum entropy is combined with the adaptive pairwise critics to serve the target Q-value as well as the surrogate policy objective. Third, we present a novel algorithm to tackle estimation errors and pursue effective and stable exploration. Finally, experimental evaluations are conducted to compare the proposed algorithm with several baselines in terms of sample complexity and stability.

2. RELATED WORK

In reinforcement learning, the agent needs to interact with the environment to collect enough knowledge for training. Without sufficient exploration, the collected data may be invalid for an optimal value. Therefore, reinforcement learning has to deal with the tradeoff between exploration and exploitation. There are several ways to enhance exploration in deep reinforcement learning (DRL), one of which is the off-policy approach which takes full advantage of past experience from replay buffer instead of on-policy data Mnih et al. (2015) . Another method adopts policy exploration to stimulate the agent's motivation for a better balance between exploration and exploitation Mnih et al. (2016) ; Haarnoja et al. (2018a) . Among them, soft actor-critic (SAC) Haarnoja et al. (2018a) achieves good performance on a set of continuous control tasks by adopting stochastic policies and maximum entropy. Stochastic policies generalize the policy improvement and introduce uncertainty into action decisions over deterministic counterparts Heess et al. (2015) , and augmenting the reward return with an entropy maximization term encourages exploration, thus improving robustness and stability Ziebart et al. (2008) ; Ziebart (2010) . In recent years, many works have been proposed on top of SAC. The improvement of SAC can be realized by changing the rule of experience replay, for example, Wang & Ross (2019) samples more aggressively from recent experience while ordering the updates to ensure that updates from old data do not overwrite updates from new data, and Martin et al. ( 2021) relabels successful episodes as expert demonstrations for the agent to match. The distributional soft actor-critic (DSAC) Duan et al. (2021b) ; Ren et al. (2020) ; Ma et al. (2020) ; Duan et al. (2021c) combines the distributional return function within the maximum entropy to improve the estimation accuracy of the Q-value. It claims to prevent gradient explosion by truncating the difference between target and current return distributions, however, its assumptions of Gaussian distributions for random returns will induce more complexity and may not fit with the real distributions. Akimov (2019) ; Hou et al. (2020) reparameterize the reward representation and the policy, respectively, using a neural network transformation composed of multivariate factorization, and Ward et al. (2019) constructs normalizing flow policies before applying the squashing function to improving exploration within the SAC framework. It is empirically shown that SAC is sensitive to the temperature hyperparameter. To provide flexibility for the choice of optimal temperature, SAC-v2 Haarnoja et al. (2018b) makes the first step to automatically tune the temperature hyperparameter by formulating a constrained optimization problem for the average entropy of policy. The dual to the constrained optimization will add an additional update procedure for the dual variable in determining the temperature. However, the assumption of convexity for theoretical convergence does not hold for neural networks, and the extra hyperparameter introduced by the transformation remains undetermined and needs more trials for generalization. Meta-SAC Wang & Ni (2020) uses metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. It distinguishes metaparameters from the learnable parameters and hyperparameters, and uses some initial states to train the meta temperature. However, due to the limited data pool for the meta loss, the given experimental results show it does not perform better than SAC. Therefore, the auto adjustment of the temperature hyperparameter is still openly untouched for SAC. The way to compute the target Q-value is an crucial design in DRL. The strategies include delayed update Van Hasselt et al. (2016 ), soft updates Lillicrap et al. (2015) ; Haarnoja et al. (2018b) and sophisticated ensembles Fujimoto et al. (2019) ; Kumar et al. (2019) of target Q-value. The sophisticated ensemble is some weighted mixture of the minimum and maximum among multiple learned Q-value functions, for example, TD3 adopts the minimum of pairwise critics, and bootstrapping error reduction (BEAR) Kumar et al. (2019) increases the number of Q-functions to 4. Behavior regularized actor critic (BRAC) Wu et al. (2019) investigated these design choices and concluded that the number of Q-functions over 2 only gives marginal improvement but significantly requires more computation cost. It is reported in Wu et al. (2019) that the minimum of two Q-functions adopted in TD3 outweighes a weighed mixture of Q-values in terms of simplicity and efficiency, however, there is a wide open unexplored area between them. How to design a mixture of Q-values is still largely left untouched.

3. BACKGROUND

We consider the infinite-horizon Markov Decision Process (MDP) in continuous action spaces, denoted by the tuple (S, A, p, r) where S is the state space, A is the action space, p(•|s, a) is the transition probability of the next state s ∈ S conditioned on the current state s ∈ S and action a ∈ A, and r ∈ S × A is the reward which is the feedback from the environment of the current state s and action a. The task of reinforcement learning (RL) is to learn an optimal policy that maximizes the reward return denoted by the expectation of discounted cumulative reward. DRL combines the neural networks with RL so that the reward return can be approximated by a parameterized function, where the agent follows a behavior policy π to determine future rewards and next states. Let p(•|s, a) denotes the transition probability, then the surrogate function of the reward return can be selected as the action-value (Q-value) function with respect to the state-action pair in the form of Q π (s, a) = E p π (st|s0,a0) ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a , where r(s, a) is the immediate reward produced by the state-action pair, and γ ∈ (0, 1) is the discount horizon factor for future rewards. With the effect of behavior policy π, p π (s t |s 0 , a 0 ) = p(s 1 |s 0 , a 0 ) t E at-1∼π p(s t |s t-1 , a t-1 ) is the joint probability of all state-action pairs during an episode given the initial state-action pair (s 0 , a 0 ), and π(a t+1 |s t+1 ) indicates the probability for the agent to choose the action a t+1 given the state s t+1 . Since the reward return has the property of Bellman equation, the temporal difference (TD) Tesauro (1995) is commonly used in the critic evaluation to minimize Bellman errors over sampled transitions (s, a, r, s ), which is given by ) , where Q t stands for the target Q-value. In off-policy algorithms using experience replay, (s, a, r, s ) is the tuple stored in the replay buffer at every environment step, a is sampled from the experience pool, which is different from the on-policy next action π(s ). In the context, we use the term of 'iteration' to represent the index of updates. In the actor-critic paradigm, one iteration contains the evaluation step and the policy improvement step, which are used to update Q-value function and then optimize the policy. After the minimization of Bellman errors, the policy improvement is performed by maximizing the expected return E (s,a,r,s ) (r + γQ t (s , π(s )) -Q(s, a)) 2 Lillicrap et al. ( J(θ) = E s [Q π (s, π(s)]. In some algorithms, the policy regularization may be attached to the expected return to smooth training Kumar et al. (2019) ; Jaques et al. (2019) . These methods focus on constraining the policy gradient ∇ θ J(θ) to avoid gradient vanishing or exploding problems, which in turn reduces the estimation variance.

4. ADAPTIVE PAIRWISE CRITICS WITH ADAPTIVE ASYMPTOTIC ENTROPY

Value penalty or policy regularization is a common theme in DRL to improve stability, however, it tends to bring more hyperparameters for tuning, which will increase the difficulty for the designed algorithm to generalize to more tasks. Therefore, it is important for reasonable auto-adjustment for these hyperparameters. The approaches to these adaptations are varied. For example, PPO adapts the penalty coefficient by setting some threshold values for the KL divergence, SAC-v2 Haarnoja et al. (2018b) automatically tune the temperature hyperparameter by adding a constraint solved by a related dual form, and Meta-SAC transforms the temperature hyperparameters into metaparameters. Our work can be started by addressing a policy iteration method accompanying the adaptive pairwise critics and entropy estimation. We will first justify the adaptive pairwise critics and the adaptive asymptotic entropy, and verify the convergence of corresponding iterations, then organize the related algorithm with its usage of neural networks.

4.1. ADAPTIVE PAIRWISE CRITICS

The iteration of adaptive pairwise critic and adaptive asymptotic actor (APAA) is started by computing the revised target Q-value of a rollout following policy π, which is combined with a value penalty from the entropy exploration. Given the continuous MDP denoted by (S, A, p, r), functions Q 1 , Q 2 : S × A → R can be the Q-values of two critics, then a modified Bellman backup operator T π is given by T π Q(s t , a t ) = r(s t , a t ) + γE st+1,at+1 Q(s t+1 , a t+1 ) , where s t+1 ∼ p(•|s t , a t ) and a t+1 ∼ π(•|s t+1 ), and Q(s t , a t ) = Q(s t , a t ) -α(Λ(s t ) + k t ) log(π(a t |s t )) is the APAA Q-value function, which can be obtained by repeatedly employing T π for any policy π. Λ(s t ) is the adaptive random variable (ARV) dependent on the state s t , 0 ≤ k t ≤ 1 is the asymptotic variable gradually increasing from 0 to 1 as the time step proceeds, and α is the fixed temperature hyperparameter. The sum of ARV and the asymptotic variable compose the adaptive asymptotic temperature for the entropy. The joint Q-value function Q is formularized as Q(s t , a t ) = (1 -Γ(s t ))Q 1 (s t , a t ) + Γ(s t )Q 2 (s t , a t ), where 0 ≤ Γ(•) ≤ 1 is the state-dependent adaptive random weight (ARW) to adjust the influence of two critics. Lemma 1. Consider the sequence Q k+1 = T π Q k , then given the condition that the Q-values are bounded, i.e., |Q 1 (s, a)| < ∞, |Q 2 (s, a)| < ∞, ∀(s, a) ∈ S × A, the sequence Q k will converge to a unique optimal value as k → ∞. The proof of Lemma 3 can be found in Appendix A (in Supplementary Files), however, the sufficient condition is not always satisfied when the function approximation is applied. Since the state-action spaces are continuous and the transition probability is unknown in model-free DRL, the Q-value function cannot be formulated or tabulated by the state-action pairs, which means the function approximation gives no absolute guarantee for the bounded Q-values. Therefore, instead of repeatedly applying (2) directly by equality, the practical evaluation step is estimated by minimizing the expected mean square error (MSE) between T π Q k (s, a) and Q 1,k+1 (s, a) or Q 2,k+1 (s, a). Once the expected MSE converges to zero, the two Q-value functions updated based on (2) will end up with little fluctuation around the same fixed point when the hyperparameters are chosen properly.

5. ADAPTIVE ASYMPTOTIC ENTROPY

When it comes to the policy improvement step, the purpose of APAA iteration contains two points, which aim to improve the Q-value for each update as while as projecting the policy onto a normalized distribution. The policy update step is given by π new = arg max π∈ E st,at [Q(s t , a t ) -α(Λ(s t ) + k t ) log(π(a t |s t ))] = arg min π∈ E st D KL π(•|s t ) exp Q(s t , •) αΛ(s t ) + αk t , where s t ∼ S, a t ∼ π(•|s t ), and the choice of policy π is limited to a set of parameterized Gaussian distributions for flexibility. With modest computation, the second equality holds, which can be found in Appendix B. The KL divergence D KL shows that the improved policy is updated towards the distribution constituted by the exponential of the normalized Q-value function. The adaptive asymptotic temperature Λ(s t ) + k t is not dependent on the action, and thus does not contribute to the policy gradient. However, it is still important to be chosen for a better insurance of the expected policy improvement. In continuous control problems of model-free DRL, where the transition probability is unknown and the state and action spaces are both continuous, it is not possible to provides policy improvement at every state-action point over S × A, which we call as the absolute policy improvement in this paper. Therefore, we propose a practical standard for the policy improvement, which is named as the expected policy improvement. It shows that the projected policy in (17) can produce higher updated Q-value with expression given by (1), and the result is organized in Lemma 4. Lemma 2. Denote π new and π old as the policies before and after the update defined in (17), respectively. Then the expected policy improvement, i.e., E (st,at)∼S×A [Q πnew (s t , a t )-Q π old (s t , a t )] ≥ 0, can be guaranteed. The proof of Lemma 4 can be found in Appendix C. Besides projecting the policy into a selected set of distributions, (17) also maximizes the expectation of APAA Q-value function defined in (3) by choosing the specific adaptive asymptotic temperature Λ(s t ) + k t , which is the key to the guarantee of the wanted expected policy improvement, shown in the second step of proving Lemma 4. Furthermore, in discrete control problems, where the state-action spaces are both discrete and bounded, the absolute policy improvement can be realized by removing the expectation over s ∼ S in (17). The APAA iteration alternates between the policy evaluation and the expected policy improvement steps, and will converge to the optimal policy which provides higher expected Q-value than the other policies in . The theorem describing the APAA iteration is organized in Theorem 1. Let l t be the learning rate at time step t, then given the condition that 0 ≤ l t (x) ≤ 1, t l t (x) = ∞, t l 2 t (x) < ∞ w.p.1., repeated application of policy evaluation and expected policy improvement will converge to an optimal policy π ∈ such that E (st,at)∼S×A [Q π (s t , a t ) -Q π (s t , a t )] ≥ 0, ∀π ∈ . Proof See Appendix D.

5.1. ALGORITHM OF ADAPTIVE PAIRWISE CRITICS WITH ADAPTIVE ASYMPTOTIC ENTROPY

We have discussed above the practical scenario of Theorem 2 in large continuous domains, which requires parameterized function approximations for both the Q-value function and the policy. To stabilize the training process, separated current and target networks are provided for both the Qvalue function and the policy. Based on these parameterized networks and (2), the loss function for the update of critic parameters in the policy evaluation step can be estimated by L(ω i ) = E (s,a,r,s ) 1 2 (r + γQ t (s , a ) -Q(s, a)) 2 , ( ) where a = π θ (s ) is the action following the target policy parameterized by θ , and (s, a, r, s ) is a tuple of history data sampled from the experience pool. And Q t (s , a ) = (1 -Γ µ (s ))Q ω 1 (s , a ) + Γ µ (s )Q ω 2 (s , a ) -α(Λ λ (s ) + k) log(π θ (a |s )), Q(s, a) = (1 -Γ µ (s))Q ω 1 (s, a) + Γ µ (s)Q ω 2 (s, a), where ω 1 , ω 2 , ω 1 and ω 2 parameterize two critic networks and their target estimates, respectively. Besides, Λ λ is target ARV parameterized by λ , k is the asymptotic variable increasing from 0 to 1 as the time step proceeds, the state-dependent ARW Γ parameterized by µ and µ is clipped in [0, 1] to determine the influence of two Q-value functions, and π θ (•|s ) is the target policy distribution conditioned on the next state s . By minimizing (7), the critic parameters can be updated for each policy evaluation step. Then (7) can be optimized with stochastic gradient ˆ ω i L(ω i ) = E (s,a,r,s ) [ ˆ ω i Q(s, a)(Q(s, a) -r -γQ t (s , a ))] f or i ∈ {1, 2}. It is noticeable that two extra networks has been added for ARV and ARW, however, to avoid introducing extra more saddle point problems, we see their trainable parameters as part of the actor parameter, i.e., the actor parameter is composed of the policy parameter, the ARV parameter and the ARW parameter. Then the surrogate objective function to update the current actor parameter (θ, λ, µ) in the expected policy improvement step (see Lemma 4) can be given by J(θ, λ, µ) = E s [Q(s, a) -α(Λ λ (s) + k) log(π θ (a|s))] , where s comes from the tuple of history data, a = π θ (s) is the reparameterized action based on s and the policy network parameterized by θ. Λ λ is current ARV parameterized by λ, Γ µ is current ARW parameterized by µ, and π θ (•|s) is the current policy distribution conditioned on the current state s. By maximizing (11), the actor parameter can be updated for policy improvement each step. The gradient of ( 11) is computed as ˆ θ J(θ, λ, µ) = E s ˆ a Q(s, a) ˆ θ π θ (s) - α ˆ θ π θ (a|s)(Λ λ (s) + k) π θ (a|s) , ( ) ˆ λ J(θ, λ, µ) = E s -α ˆ λ Λ λ (s) log(π θ (a|s)) , ˆ µ J(θ, λ, µ) = E s ˆ µ Q(s, a) . Then the target parameters (ω 1 , ω 2 , θ , λ , µ ) are updated following the "soft" target updates Lillicrap et al. ( 2015) by (ω 1 , ω 2 , θ, λ, µ), in the way of ω i t+1 ← τ ω i t+1 + (1 -τ )ω i t f or i ∈ {1, 2} θ t+1 ← τ θ t+1 + (1 -τ 1 )θ t , λ t+1 ← τ λ t+1 + (1 -τ 1 )λ t , µ t+1 ← τ µ t+1 + (1 -τ 1 )µ t , ( ) where 0 ≤ τ < 1 is the factor to control the speed of policy updates for the sake of small value error at each iteration, and 0 ≤ τ 1 < 1 is set as 1 in our application. We organize the above procedures as the adaptive pairwise critics with adaptive asymptotic entropy (APAA) algorithm, whose pseudocode is described by Algorithm 1. The algorithm alternates between running the environment steps to collect experience and updating the network parameters using the stochastic gradients computed by the sampled batches from the experience pool. In ( 10), ( 12), ( 13) and ( 14), the gradients are in their expectation forms, however, practically they are averaged over the results of sampled tuples, which usually follow policies parameterized by different parameters in off-policy methods. In some algorithms, one gradient step follows one or several environment steps to stabilize the training process.

6.1. BENCHMARKS

The performance of our proposed method is compared with several prior model-free reinforcement learning algorithms in terms of the sample complexity and stability on a set of gym continuous control tasks from the MuJoCo suite Todorov et al. (2012) ; Brockman et al. (2016) . Fig. 1 shows the illustrations of benchmarks adopted in this paper.

6.2. BASELINES

The adopted baselines include deep deterministic policy gradient (DDPG) Lillicrap et al. ( 2015), TD3, SAC and BRAC. Before the existence of SAC, DDPG is regarded as one of the most efficient Algorithm 1 APAA Algorithm 1: Input: The update maximum time step T 2: Initialize parameters ω 1 ← ω 1 0 , ω 2 ← ω 2 0 , θ ← θ 0 , λ ← λ 0 , µ ← µ 0 3: Initialize target parameters ω 1 ← ω 1 0 , ω 2 ← ω 2 0 , θ ← θ 0 , λ ← λ 0 , µ ← µ 0 4 : Initialize the learning rates l c , l a for the critic and the actor, the time step t ← 0, the soft update hyperparameter τ , the maximum time step T , the batch size B and the replay buffer D ← ∅.  ω i t+1 ← ω i t -l c ˆ ω i t L(ω i t ) for i ∈ {1, 2} following Eq. ( 10) 12: θ t+1 ← θ t + l a ˆ θt J(θ t , λ t , µ t ) following Eq. ( 12) 13: λ t+1 ← λ t + l a ˆ λt J(θ t , λ t , µ t ) following Eq. ( 13) 14: µ t+1 ← µ t + l a ˆ µt J(θ t , λ t , µ t ) following Eq. ( 14) 15: Our proposed algorithm shares the same set of hyperparameters with other baselines to keep fairness. The gaussian exploration noise with a fixed variance of 0.2 is added to the action at every time step, then the noisy action is clipped within the set boundary. With the discount horizon factor chosen as 0.99, algorithms including the proposed one, SAC and BRAC adopt the entropy term, which is computed by normal random policies, whose mean and variance are parameterized by fully connected networks with two hidden layers, each of which has 256 units. Except that, both DDPG and TD3 use deterministic policies, also parameterized by fully connected networks with two hidden layers. We organize the network architectures and hyperparameters in Appendix E and F, respectively. The Adam optimizer Kingma & Ba ( 2014) is used to update the network parameters. ω i t+1 ← τ ω i t+1 + (1 -τ )ω i t for i ∈ {1,

6.3. RESULTS

We train 10 seeds for each algorithm to keep a fair comparison. After every 500 iterations (time steps), we launch a evaluation procedure, which averages 10 rollouts for a test. The average reward of a test will be recorded at every evaluation procedure, and all tests throughout the time step scale give the result of each algorithm. The average rewards of algorithms tested in chosen benchmarks are shown Fig. 2 with 95% confidence interval (CI). From Figs. 5(a), 5(b) and 5(c), we can observe overwhelmed advantage of APAA over other baselines. In Hopper environment, since the converged value is far lower than other benchmarks, the tolerance for the fluctuation around convergence is much lower, which causes variables, the best performance cannot be ensured for every seed, which means the potential reduced stability and convergence (partly told by Fig. 2 ) are reasonable. Since SAC has an variant working on automatic adjustment of the temperature hyperparameter Haarnoja et al. (2018b) , we use SAC-t to represent it and compare its performance with APAA in Fig. 3 with 95% CI. SAC-t adds an extra hyperparameter H as the target entropy in exchange of the temperature, which may not lead to better performance because the target entropy cannot be generalized and also needs automatic tuning, as reported by Wu et al. (2019) . According to Fig. 3 , SAC-t fails to produce better performance than APAA given the choice of the target entropy as 0.5, which implies the right way of adjusting the temperature in APAA. Due to the page limit, we make the comparison of value estimates in Appendix G.

7. CONCLUSION

In this paper, we proposed a state-dependent adaptive temperature to encourage policy exploration, which can strike a better balance between the efficiency and stability by introducing an asymptotic maximum entropy. Then the asymptotic maximum entropy is combined with the adaptive pairwise critics to benefit the policy evaluation and improvement steps. Based on the above two components, we present APAA to gain better tradeoff between efficiency and stability. We evaluate our method on a set of Gym tasks, and the results show that the proposed algorithms have better performance than several baselines on continuous control. A PROOF OF LEMMA 3 Lemma 3. Consider the sequence Q k+1 = T π Q k , then given the condition that the Q-values are bounded, i.e., |Q 1 (s, a)| < ∞, |Q 2 (s, a)| < ∞, ∀(s, a) ∈ S × A, the sequence Q k will converge to a unique optimal value as k → ∞. Proof |T π Q(s t , a t ) -T π Q (s t , a t )| =γ E st+1∼p,at+1∼π [Q(s t+1 , a t+1 ) -Q (s t+1 , a t+1 )] ≤γE st+1∼p,at+1∼π [|Q(s t+1 , a t+1 ) -Q (s t+1 , a t+1 )|] ≤γ max st+1∼p,at+1∼π |Q(s t+1 , a t+1 ) -Q (s t+1 , a t+1 )| =γ Q -Q ∞ , where • ∞ means the max norm, p and π is short for p(•|s t , a t ) and π(•|s t ). Since the Q-values are assumed to be bounded, the adaptive pairwise critics Q is also bounded, then the third inequality holds. We reach a conclusion that ∀(s t , a t ) ∈ S × A, (16) holds, which can be rewritten as maxnorm contraction mapping as T π Q -T π Q ∞ ≤ γ Q -Q ∞ . According to the property of contraction operator, the sequence Q k+1 = T π Q k will converge to its fixed point. B PROOF OF ( 17) π new = arg max π∈ E st,at [Q(s t , a t ) -α(Λ(s t ) + k t ) log(π(a t |s t ))] = arg min π∈ E st D KL π(•|s t ) exp Q(s t , •) αΛ(s t ) + αk t , Proof π new = arg max π∈ E st∼S,at∼π(•|st) [Q(s t , a t ) -α(Λ(s t ) + k t ) log(π(a t |s t ))] = arg min π∈ E st∼S,at∼π(•|st) log(π(a t |s t )) - Q(s t , a t ) α(Λ(s t ) + k t ) = arg min π∈ E st∼S,at∼A π(a t |s t ) log(π(a t |s t )) - Q(s t , a t ) α(Λ(s t ) + k t ) = arg min π∈ E st∼S D KL π(•|s t ) exp Q(s t , •) α(Λ(s t ) + k t ) , where the second equality holds because Λ(s t ) and k t are not dependent on a t , and D KL (• •) is the KL divergence. C PROOF OF LEMMA 4 Lemma 4. Denote π new and π old as the policies before and after the update defined in (17), respectively. Then the expected policy improvement, i.e., E (st,at)∼S×A [Q πnew (s t , a t )-Q π old (s t , a t )] ≥ 0, can be guaranteed. Proof var(T πnew Q π old -Q πnew ) = E (T πnew Q π old -Q πnew ) 2 -(E[T πnew Q π old -Q πnew ]) 2 ≥ 0, where var(•) represents the variance. According to (19), we have (E[T πnew Q π old -Q πnew ]) 2 ≤ E (T πnew Q π old -Q πnew ) 2 , then E[T πnew Q π old ] will converge to E[Q πnew ] based on the expected MSE analyzed in the paragraph after Lemma 3. This constitutes the first step of the proof for the expected policy improvement, which can be written as E (st,at)∼S×A [Q πnew (s t , a t )] = E (st,at)∼S×A [T πnew Q π old (s t , a t )] ≥ E (st,at)∼S×A [T π old Q π old (s t , a t )] = E (st,at)∼S×A [r(s t , a t )] + γE st+1∼p(•|st,at),at+1∼π(•|st+1) [Q π old (s t+1 , a t+1 )] ≥ E (st,at)∼S×A [r(s t , a t )] + γE st+1∼p(•|st,at),at+1∼π(•|st+1) [Q π old (s t+1 , a t+1 )] = E (st,at)∼S×A [Q π old (s t , a t )], where the second inequality holds because of the update rule following the first equality of ( 17), the third equality is the expected form of modified Bellman backup operator, the forth inequality holds because both the entropy and the adaptive asymptotic temperature are nonnegative, and the last equality is a variant of the bellman equation. Because of the unknown transition probability and continuous state-action spaces in continuous model-free DRL, the Q-value function is usually approximated by neural networks, which makes it impossible to directly apply the bellman equation to every state-action pair over S × A. Under the circumstance, the bellman equation only holds in statistical sense.

D PROOF OF THEOREM 2

Theorem 2. Let l t be the learning rate at time step t, then given the condition that 0 ≤ l t (x) ≤ 1, t l t (x) = ∞, t l 2 t (x) < ∞ w.p.1., repeated application of policy evaluation and expected policy improvement will converge to an optimal policy π ∈ such that E (st,at)∼S×A [Q π (s t , a t ) -Q π (s t , a t )] ≥ 0, ∀π ∈ . Proof According to Lemma 1 of (SINGH et al., 2000) , the condition ( 22) can make the expected MSE of temporal difference (TD) converge to zero, which validates the policy evaluation of adaptive asymptotic iteration and prove the first step of (21) to be true. With the monotonic increasing of the updated expected Q-value, the converged optimal policy will render E (st,at)∼S×A [Q π (s t , a t ) -Q π (s t , a t )] ≥ 0, ∀π ∈ .

E NETWORK ARCHITECTURE

We construct the critic network using a fully-connected MLP with two hidden layers. The input is composed of the state and action, outputting a value representing the Q-value. The ReLU functions are adopted to activate the two hidden layers. The setting of policy network follows normal random distribution, whose expectation and variance are fully-connected networks fed only by the state. Both of them have two hidden layers activated by the ReLU function. After the hidden layers, a Tanh function and a Softplus function follows to form the expectation and variance, respectively. With the expectation and variance, a normal distribution can be achieved to represent the random policy. The network of state-dependent ARV Λ and ARW Γ are constructed similar to either the expectation and variance of policy network except replacing the last nonlinearity activation by a Sigmoid function. The architecture of networks are plotted in Fig. 4 . The above mentioned network architecture is adopted for the random policy. For the algorithm using the deterministic policy, the critic is constructed in the same way, however, the actor network is the same as that of the expectation of normal random distribution. F HYPERPARAMETERS Table 1 lists the common hyperparameters shared by all experiments and their respective settings. In this table, LR a means the learning rate of the actor (includes lambda in our proposed algorithm), and LR c means the learning rate of critics. τ a and τ c represent soft update hyperparameter of the actor and the critic, respectively, and τ a = 1 means we adopt immediate update for the actor. The symbol var represents the variance of gaussian exploration noise, and α is the fixed temperature hyperparameter, which is applied in algorithms except DDPG and TD3. α d represents the Wight factor of KL divergence for policy regularization applied in BRAC, and β is the asymptotic rise 



Figure 1: (a) Ant-v3; (b) Halfcheetah-v3; (c) Hopper-v3; (d) Walker2d-v3

Figure 2: Average reward versus time step in (a) Ant-v3; (b) Halfcheetah-v3; (c) Hopper-v3; (d) Walker2d-v3; (e) Humanoid-v3

Figure 4: Architecture of networks.

Figure 5: Comparison between the value estimate and the true value in (a) Ant-v3; (b) Halfcheetah-v3; (c) Hopper-v3; (d) Walker2d-v3; (e) Humanoid-v3. 'Q True' means the true value and 'Q est' means the value estimate.

Select action a t ∼ π θt (a t |s t )Observe the reward and next state s t+1 , r t ∼ p(s t+1 |s t , a t )

G VALUE ESTIMATE

We plot the value estimate, approximated by the trained Q-value networks, over time steps to compare with the true value, which is represented by the discount return of a rollout starting from 1000 random state-action pairs from the replay buffer. The discount return of a rollout is recorded every 500 time steps, which follows the updated policy at that time step and is different from the average return. The differences between the value estimates and the true values is illustrated in Fig. 5 . From these figures, we can observe that the algorithm without tuning the target Q-value (DDPG) suffers great overestimation, however, simply choosing the smaller Q-value from a pair of critics (TD3 and SAC) will bring nonnegligible underestimation, instead. Since inaccurate value estimates will lead to poor policy updates, neither underestimation or overestimation is wanted. Dynamic adjustment of target Q-value used in ARW of APAA provides a preference.

