MQES: MAX-Q ENTROPY SEARCH FOR EFFICIENT EXPLORATION IN CONTINUOUS REINFORCEMENT LEARNING

Abstract

The principle of optimism in the face of (aleatoric and epistemic) uncertainty has been utilized to design efficient exploration strategies for Reinforcement Learning (RL). Different from most prior work targeting at discrete action space, we propose a generally information-theoretic exploration principle called Max-Q Entropy Search (MQES) for continuous RL algorithms. MQES formulates the exploration policy to maximize the information about the globally optimal distribution of Q function, which could explore optimistically and avoid over-exploration by recognizing the epistemic and aleatoric uncertainty, respectively. To make MQES practically tractable, we firstly incorporate distributional and ensemble Q function approximations to MQES, which could formulate the epistemic and aleatoric uncertainty accordingly. Then, we introduce a constraint to stabilize the training, and solve the constrained MQES problem to derive the exploration policy in closed form. Empirical evaluations show that MQES outperforms state-of-the-art algorithms on Mujoco environments.

1. INTRODUCTION

In Reinforcement Learning (RL), one of the fundamental problems is exploration-exploitation dilemma, i.e., the agents explore the states with imperfect knowledge to improve future reward or instead maximize the intermediate reward at the perfectly understood states. The main obstacle of designing efficient exploration strategies is how the agents decide whether the unexplored states leading high cumulative reward or not. Popular exploration strategies, like -greedy (Sutton & Barto, 1998) and sampling from stochastic policy (Haarnoja et al., 2018) , lead to undirected exploration through additional random permutations. Recently, uncertainty of systems are introduced to guide the exploration (Kirschner & Krause, 2018; Mavrin et al., 2019; Clements et al., 2019; Ciosek et al., 2019) . Basically, as Moerland et al. (2017) points out, two source of uncertainty exists in the RL system, i.e., epistemic and aleatoric uncertainty. Epistemic uncertainty is also called parametric uncertainty, which is the ambiguity of models arisen from the imperfect knowledge to the environment, and can be reduced with more data. Aleatoric uncertainty is an intrinsic variation associated with the environment, which is caused by the randomness of environment, and is not affected by the model. In the RL system, if the states are seldom visited, the epistemic uncertainty at these states are relatively large. Hence, the exploration methods should encourage exploration when epistemic uncertainty is large. Moreover, heteroscedastic aleatoric uncertainty means that different states may have difference randomness, which renders different aleatoric uncertainty. If we do not distinguish these two uncertainties and formulate them separately, we may explore the states visited frequently but with high randomness, i.e., low epistemic uncertainty and high aleatoric uncertainty, which is undesirable. By introducing uncertainty, the exploration objectives like Thompson Sampling (TS) (Thompson, 1933; Osband et al., 2016) and Upper Confidence Bound (UCB) (Auer, 2002; Mavrin et al., 2019; Chen et al., 2017) are utilized to guide the exploration in RL. However, since the aleatoric uncertainty in the RL systems are heteroscedastic, i.e., the aleatoric uncertainty depends on states and actions and can be different, the above methods are not efficient. Hence, Nikolov et al. (2019) proposes novel exploration objective called Information-Directed Sampling (IDS) accounting for epistemic uncertainty and heteroscedastic aleatoric uncertainty. However, these methods (Nikolov et al., 2019; Mavrin et al., 2019; Chen et al., 2017; Osband et al., 2016) can only be applied in the environment with discrete action space. In this paper, we propose a generally information-theoretic principle called Max-Q Entropy Search (MQES) for off-policy continuous RL algorithms. Further, as an application example of MQES, we combine distributional RL with soft actor-critic method, where the epistemic and aleatoric uncertainty are formulated accordingly. Then, we incorporate MQES to Distributional Soft Actor-Critic (DSAC) (Ma et al., 2020) method, and show how MQES utilizes both uncertainty to explore. Finally, our results on Mujoco environments show that our method can substantially outperform alternative state-of-the-art algorithms.

2. RELATED WORK

Efficient exploration can improve the efficiency and performance of RL algorithms. With the increasing emphasis on exploration efficiency, various exploration methods have been developed. One kind of methods use intrinsic motivation to stimulate agent to explore from different perspectives, such as count-based novelty (Martin et al., 2017; Ostrovski et al., 2017; Bellemare et al., 2016; Tang et al., 2017; Fox et al., 2018) , prediction error (Pathak et al., 2017) , reachability (Savinov et al., 2019) and information gain on environment dynamics (Houthooft et al., 2016) . Some recently proposed methods in DRL, originating from tracking uncertainty, do efficient exploration under the principle of OFU (optimism in the face of uncertainty), such as Thompson Sampling (Thompson, 1933; Osband et al., 2016) , IDS (Nikolov et al., 2019; Clements et al., 2019) and other customized methods (Moerland et al., 2017; Pathak et al., 2019) . Methods for tracking uncertainty. Bootstrapped DQN (Osband et al., 2016) combines Thompson sampling with value-based algorithms in RL. It is similar to PSRL (Strens, 2000; Osband et al., 2013) , and leverages the uncertainty produced by the value estimations for deep exploration. Bootstrapped DQN has become the common baseline for lots of recent works, and also the well-used approach for capturing epistemic uncertainty (Kirschner & Krause, 2018; Ciosek et al., 2019) . However, this takes only epistemic uncertainty into account. Distributional RL approximates the return distribution directly, such as Categorical DQN (C51) (Bellemare et al., 2017) , QR-DQN (Dabney et al., 2018b) and IQN (Dabney et al., 2018a) . Return distribution can be used to approximate aleatoric uncertainty, but those methods do not take advantage of the return distribution for exploration. Exploration with two types of uncertainty. Traditional OFU methods either focus only on the epistemic uncertainty, or consider the two kinds of uncertainty as a whole, which can easily lead the naive solution to favor actions with higher variances. To address that, Mavrin et al. (2019) studies how to take advantage of distributions learned by distributional RL methods for efficient exploration under both kinds of uncertainty, proposing Decaying Left Truncated Variance (DLTV). Nikolov et al. (2019) and Clements et al. (2019) propose to use Information Direct Sampling (Kirschner & Krause, 2018) for efficient exploration in RL (IDS for RL), which estimate both kinds of uncertainty and use IDS to make decision for acting with environment. We refer to the practice of uncertainty estimation in (Clements et al., 2019) as shown in Sec. 4.2.1. IDS integrates both uncertainty and has made progress on the issue of exploration, but this is limited on discrete action space. We do focus on how best to exploit both uncertainty for efficient exploration in a continuous action space in our paper. Optimistic Actor Critic. More closely related to our work is the paper of OAC (Ciosek et al., 2019) , which uses epistemic uncertainty to build the upper bound of Q estimation Q UB . OAC is based on Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , additionally proposing exploration bonus to facilitate exploration. Despite the advantages that OAC has achieved over SAC, it does not consider the potential impact of the aleatoric uncertainty, which may cause misleading for exploration.

3.1. DISTRIBUTIONAL RL

Distributional RL methods study distributions rather than point estimates, which introduce aleatoric uncertainty from distributional perspective. There are different approaches to represent distribution in RL. In our paper, we focus on quantile regression used in QR-DQN (Dabney et al., 2018b) , where the randomness of state-action value is represented by the quantile random variable Z with value z. Z maps the state-action pair to a uniform probability distribution supported on z i , where z i indicates the value of the corresponding quantile estimates. If τ i is defined as the quantile fraction, the cumulative probabilities of such quantile distribution is denoted by F Z (z i ) = P r(Z < z i ) = τ i = 1/N for i ∈ 1, ..., N . Similar to the Bellman operator in the traditional Q-Learning (Watkins & Dayan, 1992) , the distributional Bellman operator T π D under policy π is given as: T π D Z(s t , a t ) D = R(s t , a t ) + γZ(s t+1 , a t+1 ), a t+1 ∼ π(•|s t+1 ). Notice that this operates on random variables, D = denotes that distributions on both sides have equal probability laws. Based on the distributional Bellman operator, Dabney et al. (2018b) proposes QR-DQN to train quantile estimations via the quantile regression loss, which is denoted as: L QR (θ) = 1 N N i=1 N j=1 [ρ τi (δ i,j )] where δ i,j = R(s t , a t ) + γz j (s t+1 , a t+1 ; θ) -z i (s t , a t ; θ), ρ τ (u) = u * (τ -1 u<0 ), and τi means the quantile midpoints, which is defined as τi = τi+1+τi 2 .

3.2. DISTRIBUTIONAL SOFT ACTOR-CRITIC METHODS

Following Ma et al. (2020) , Distributional RL has been successfully integrated with soft Actor-Critic (SAC) algorithm. Here, considering the maximum entropy RL, the distributional soft Bellman operator T π DS is defined as follows: T π DS Z(s t , a t ) D = R(s t , a t ) + γ[Z(s t+1 , a t+1 ) -αlogπ(a t+1 |s t+1 )] where a t+1 ∼ π(•|s t+1 ), s t+1 ∼ P(•|s t , a t ). The quantile regression loss in DSAC is different from original QR-DQN only on the δ i,j by considering the maximum entropy RL framework. DSAC extends the clipped double Q-Learning proposed on TD3 (Fujimoto et al., 2018) to overcome the overestimation problem. Two Quantile Regression Deep Q Networks have the same structure that are parameterized by θ k , k = 1, 2. Following the clipped double Q-Learning, the TD-error of DSAC is defined as: y t i = min k=1,2 z i (s t+1 , a t+1 ; θk ) (4) δ k i,j = R(s t , a t ) + γ[y t i -α log π(a t+1 |s t+1 ; φ)] -z j (s t , a t ; θ k ) where θ and φ represents their target networks respectively. DSAC has the modified version of critic, while the update of actors is unaffected. It is worth noticing that the state-action value is the minimum value of the expectation on certain distributions, as Q(s t , a t ; θ) = min k=1,2 Q(s t , a t ; θ k ) = 1 N min k=1,2 N -1 i=0 z i (s t , a t ; θ k ) Thus, in DSAC, the original problem aims to maximize the following objective function: J π (φ) = E st∼D, ∼N [logπ(f (s t , t ; φ)|s t ) -Q(s t , f (s t , t ; φ); θ)], where D is the replay buffer, f (s t , t ; φ) means sampling action with re-parameterized policy.

4. ALGORITHM

This paper proposes a new exploration principle for continuous RL algorithms, i.e., MQES, which leverages epistemic and aleatoric uncertainties to explore optimistically and avoid over-exploration. To make MQES practically tractable, distributional and ensemble Q function approximations are introduced to formulate the epistemic and aleatoric uncertainty accordingly. Nevertheless, a constraint is introduced in the MQES to stabilize the training, and the approximated exploration policy is derived in the closed form. All these mechanisms are detailed in the following sections accordingly.

4.1. EXPLORATION STRATEGY: MAX-Q ENTROPY SEARCH

To achieve a better exploration, MQES derives an exploration policy π E which aims at reducing the epistemic uncertainty and obtain more knowledge of the globally optimal Q function. Firstly, we define exploration action random variable A E ∼ π E (a|s) with value a E ∈ A, where A is the action space. Z * (s, a * ) is the random variable following the distribution describing the randomness of return obtained by globally optimal policy π * , and the value of Z * (s, a * ) is defined as z * (s, a * ). Through maximizing the mutual information between random variables Z * (s, a * ) and A E , we reduce the uncertainty of globally optimal Q function Q * to encourage exploration. Specifically, at timestamp t, we find the exploration policy π E in the candidate distribution family Π that can maximize the information about the optimal action random variable A * as follows: π E = arg max π∈Π F π (s t ), where F(•) is the mutual information and can be written as follows: F π (s t ) = MI(Z * (s, a * ), A|s = s t ) = H [π(a t |s t )] -H [p(a t |z * (s t , a * ), s t )] . Here MI(•) and H(•) denote the mutual information and entropy of the random variable, respectively. To obtain exploration policy π E , we need to measure the posterior probability p(•) in the above equation. For simplicity, we omit the timestamp t in the following. To measure the posterior probability p(a|a * , s), we propose the following proposition. Proposition 1. Generally, the posterior probability is estimated as follows: p(a|z * (s, a * ), s) ∝ π(a|s)Φ Z π (s,a) (z * (s, a * )), where Φ x is the cumulative distribution function (CDF) of x, Z * and Z π E are the random variables, whose distributions describing the randomness of the returns obtained by optimal policy π * and exploration policy π E , respectively, and z * is the value of random variable Z * . (see proof in Appendix A). To measure the intractable distribution of Z * during training, we use the Ẑ * for approximation (i.e., Ẑ * ≈ Z * ), which will be defined later. In general, Ẑ * is referred to as the optimistic approximator (Mavrin et al., 2019; Chen et al., 2017) , and can be formulated using the uncertainty, which will be detailed in Sec. 4.2.1. Therefore, the F π (s) in Eq. 9 can be estimated as follows: F π (s) ≈ Fπ (s) = E π, Ẑ * [log π(a|s)(G(s, a) -1) + G(s, a) log G(s, a)] , where G(s, a) = 1 C * Φ Z π (s,a) (ẑ * (s, a)). Specifically, G(s, a) measures the difference between Z π and Ẑ * , i.e., large value of CDF means that ẑ * is much bigger than the mean of Z π . By introducing distributional value functions in Eq. 10 to estimate the posterior probability, we can use the uncertainty of value function to encourage the exploration, which will be discuss in Sec. 4.2.2.

4.2. MQES-BASED EXPLORATION FOR MODERN RL ALGORITHMS

In this section, we propose a scheme to incorporate exploration policy derived from MQES to existing policy-based algorithms, e.g., SAC and TD3, which renders the stable and well-performed algorithm with more efficient exploration. First, to obtain exploration policy, we employ a constraint to ensure the difference between the exploration and target policies within a certain range (i.e., KL(π||π T ) ≤ α). The target policy π T here, we mean the policy learned by any existing policy-based algorithms. It is worth noting that MQES introduces distributional and ensemble critics to the existing framework (e.g., introducing distributional critic to SAC formulates DSAC). Moreover, we utilize the critic of target policy to formulate Ẑ * and Z π E , which will be stated later. Intuitively, introducing the constraint in MQES ensures that the critic of target policy guides the exploration properly and stabilizing the training. Otherwise, the exploration could be ineffective, and the update of target policy can be dramatically bad. Specifically, if the difference between π and π T are with significant difference, Z π T could not criticize exploration policy properly, and it may explore with wrong guidance, where the experiences stored in the replay buffer are with poor quality and the update of target policy fails sequentially. Second, after introducing the KL constraint, the MQES-based exploration for modern RL algorithms is given as follows: π E (a|s) = arg max π Fπ (s), s.t. KL(π||π T ) ≤ α, where both the exploration π E = N (µ E , Σ E ) and target policy π T = N (µ T , Σ T ) are Gaussian distributions. By expanding Fπ linearly, we solve the problem in Eq. 12 using the following proposition: Proposition 2. The MQES exploration policy π E = N (µ E , Σ E ) derived from Eq. 12 is as follows: µ E = µ T + √ 2α E Ẑ * m ∂ Ẑ * (s,a) ∂a | a=µ T Σ E Σ E E Ẑ * m ∂ Ẑ * (s, a) ∂a | a=µ T , Σ E = Σ T . ( ) In specific, the i-th element of vector m is m i = log G(s,µ T ) √ (2π)Σii +1, G(s, µ T ) = Φ Z π E (s,µ T ) ( Ẑ * (s,µ T )) C , i ∈ {1, ..., n} and n is the action dimension (see proof in Appendix B). It is worth noting that the expectation against E Ẑ * can be estimated by sampling, and the estimation of Eq. 13 is as follows: µ E = µ T + √ 2α K K i=1 1 m ∂ Ẑ * i (s,a) ∂a | a=µ T Σ E Σ E m ∂ Ẑ * i (s, a) ∂a | a=µ T ,

4.2.1. FORMULATION OF Ẑ * AND Z π E

In this section, we formulate the epistemic and aleatoric uncertainty with the critic of target policy, thereby distributions of Ẑ * and Z π E can be estimated. The remaining parts describe how to achieve these two estimations, respectively. Formulation of Ẑ * . In order to formulate the distribution of estimated optimal Q value, i.e., Ẑ * , we firstly estimate its upper confidential bound, denoted by Q UB . Aligned with Clements et al. ( 2019), we adopt two independent distribution approximators Z(s, a; θ 1 ) and Z(s, a; θ 2 ) parameterized by θ 1 and θ 2 , respectively. Then we measure the epistemic uncertainty first as follows: σ epistemic (s, a; θ) = 1 2 E i∼U (1,N ) |z i (s, a; θ 1 ) -z i (s, a; θ 2 )|, ( ) where N is the number of quantiles, and z i (s, a; θ) is the value of the i-th quantile drawn from Z(s, a; θ). Consequently, the upper confidential bound of Q-value is given leveraging the σ epistemic as follows: Q UB (s, a; θ) = µ Z (s, a; θ) + βσ epistemic (s, a; θ), (16) where µ Z (s, a; θ) = 1 N Σ N i=1 1 2 Σ k=1,2 z i (s, a; θ k ) is the mean estimation over quantile distributions, β determines the magnitude of uncertainty we use. Q UB is commonly considered as a approximation of the optimal Q value in the existing work (Ciosek et al., 2019; Kirschner & Krause, 2018) . Moreover, as shown in (Dabney et al., 2018b) , the aleatoric uncertaintly can be captured by return distribution, which can be derived in our method by considering those two quantile distributions as follows: σ 2 aleatoric (s, a; θ) = var i∼U (1,N ) [E θ k z i (s, a; θ k )]. (17) Inspired by (Wang & Jegelka, 2017) , we adopt Q UB and σ 2 aleatoric as mean and variation and formulate the Gaussian distribution Ẑ * as follows: Ẑ * (s, a; θ) ∼ N (Q UB (s, a; θ), σ 2 aleatoric (s, a; θ))1 ẑ * ≥Q UB , where Ẑ * follows truncated Gaussian distribution ensuring the globally optimal constraint, i.e., E[Z * ] = Q * ≥ Q U B . Nevertheless, since the distributions of Q functions describe the aleatoric uncertainty, we set the variance of Ẑ * as the aleatoric uncertainty obtained from Eq. 17. Formulation of Z π E . Since the target for critic in the advanced algorithms, like SAC and TD3, is usually estimated pessimistically, we take the pessimistic estimation for Z π E to make MQES compatible with the existing modern RL algorithms. Here we present two modeling approaches: Gaussian and quantile distributions. Intuitively, we assume Z π E to be the Gaussian distribution with pessimistic estimation as the mean: Z π E (s, a; θ) ∼ N (Q LB (s, a; θ), σ 2 aleatoric (s, a; θ)), where Q LB (s, a; θ) = µ Z (s, a; θ) -βσ epistemic (s, a; θ), estimates its lower confidential bound. On the other hand, as the quantile distribution is a value distribution which naturally formulates the underlying aleatoric uncertainty, we can utilize the quantile functions to model the pessimistic quantile distribution directly, breaking the Gaussian assumption above. Specifically, we take the smaller estimates at each quantile, and then the distribution Z π E (s, a; θ) is a uniform distribution over each quantile value z π E i (s, a; θ), shch as z π E i (s, a; θ) = min k=1,2 z i (s, a; θ k ). Different from the uni-modal Gaussian distribution, the quantile function is able to represent multimodal distributions, which is more flexible. As the quantile function represents the inverse function of CDF, meaning that we can easily get a general idea of the properties of this pessimistic quantile distribution. Calculate aleatoric uncertainty σ 2 aleatoric (s t , a t ) according to Eq. 17 5: Construct Ẑ * (s t , a t ) using Q UB (s t , a t ) and σ 2 aleatoric (s t , a t ) (see Eq. 18) 6: Construct Z π E (s t , a t ) according to Eq. 19 / 20 7: Calculate µ E using Z π (s t , a t ) and Ẑ * (s t , a t ) according to Eq. 14 8: return π E ∼ N (µ E , σ T (s t ; φ)) The above Alg. 1 summarizes the overall procedure of MQES, including the estimation of uncertainty (Line 2, 4) and the upper confidential bound Q UB (Line 3), formulation of Z π E and Ẑ * (Line 5-6) and exploration policy generation using KL constraint (Line 7). The generated exploration policy can be adopted by any modern policy-based RL algorithms for an more effective exploration.

4.2.2. ANALYSIS OF MQES-BASED EXPLORATION

This section analytically explains how MQES encourages exploration accounting for the aleatoric and epistemic uncertainty. For simplicity, we assume that the sample number is K = 1, and Eq. 14 degrades to: µ E = µ T + √ 2α m ∂ ẑ * (s,a) ∂a | a=µ T Σ E Σ E m ∂ ẑ * (s, a) ∂a | a=µ T , where m i = log G(s,µ T ) √ (2π E )Σii + 1. Take Gaussian MQES for example, we have Z π E (s, a) ∼ N (Q LB (s, a), σ 2 aleatoric (s, a)), and the bias term added to µ E is decided by the epistemic and aleatoric uncertainty. Specifically, since epistemic uncertainty is involved in Ẑ * (s, a), the gradient ∂ ẑ * (s,a) ∂a encourages the optimistic exploration. Epistemic uncertainty-based exploration, can avoid the pessimistic underexploration. The aleatoric uncertainty is introduced by CDF Φ. If we have two state-action pairs, i.e., (s 1 , a 1 ) and (s 2 , a 2 ), and Z π E (s i , a i ) ∼ N (Q LB (s i , a i ), σ 2 aleatoric (s i , a i )), i = 1, 2, and we assume that Q LB (s 1 , a 1 ) = Q LB (s 2 , a 2 ), σ 2 1 > σ 2 2 , and ẑ * (s 1 , a 1 ) = ẑ * (s 2 , a 2 ). Obviously, Φ Z π (s1,a1) (ẑ * (s 1 , a 1 )) < Φ Z π (s2,a2) (ẑ * (s 2 , a 2 )) , which means that larger aleatoric uncertainty leads to smaller action bias. Therefore, the MQES encourages the exploration by selecting the action increasing the optimistic value function, and avoid the over-exploration by setting smaller action bias at the state, where the aleatoric uncertainty is high.

5. EXPERIMENTS

MQES is designed for efficient exploration in continuous action space problem in RL, allowing the agent to be aware of explore directions that may lead to higher optimistic value function with smaller aleatoric uncertainty. Comparisons between MQES and state-of-the-art algorithms are conducted to verify the MQES regarding the effectiveness and efficiency. Empirical evaluations show that MQES outperforms state-of-the-art algorithms on a series of continue control tasks.

5.1. IMPLEMENTATION DETAILS AND EXPERIMENT SETTINGS

We compare MQES against SAC (Haarnoja et al., 2018) and its distributional variant DSAC (Ma et al., 2020) . Ma et al. (2020) also shows the performance of TD4, which is the distributional extension of TD3 (Fujimoto et al., 2018) , and can also be used to capture epistemic and aleatoric uncertainty as is pointed out in Sec. 4.2. However, DSAC outperforms TD4 as shown in Ma et al. (2020) , so we evaluate based only on SAC and DSAC, and further implement MQES based on DSAC in order to develop the exploration ability. The training process of MQES is the same as in DSAC, except for the behavior policy used, while we enrich the experience replay with the data generated by π E . The pseudo code of the whole process can be found in Appendix C. In order to ensure a fair comparison, the hyper-parameters of DSAC and MQES are the same (see Appendix D). In addition, we have 3 hyper-parameters associated with MQES. The parameter √ 2α controls the exploration level, and β determines the magnitude of uncertainty we use, and C is the normalization factor. We implement both approaches for building Z π as illustrated in Sec. 4.2.1, and we use MQES G and MQES Q to indicate respectively to Gaussian distribution and quantile distribution. We test MQES on several tasks in Mujoco (Todorov et al., 2012) , including standard version, as well as modified sparse version and stochastic version. We limit the maximum length of each episode to 100. We run 1250 or more epochs for each task where there are 100 training steps per epoch, with evaluating every epoch where each evaluation reports the average undiscounted return with no exploration noise. 

5.2. STANDARD MUJOCO TASKS

We evaluate both MQES G and MQES Q in 5 tasks of standard Mujoco, and the result in Figure 1 and Appendix F show that our methods outperform SAC for all those tasks, and also reach better performance than DSAC. Performance. Our results demonstrate that in complex tasks, such as Humanoid-v2 and Ant-v2, our MQES-based exploration policy performs better, while DSAC suffers from the inefficiency caused by deficient exploration. In Ant-v2, DSAC was overtaken in the early stages of training and then MQES stays ahead. Also in Humanoid-v2, the performance of our algorithm always maintains better than DSAC. In some relatively easier tasks, it seems that those tasks are not very demanding for exploration, but MQES performs still at a very advanced level. In Walker2d-v2, MQES and DSAC alternated lead until 1000 epochs, after which MQES had a significant improvement. The final results are shown more specifically in the Table 3 . Gaussian and Quantile Z π . One can find that, expect for Humanoid-v2, there is no absolute superiority between the two modeling approaches. We hypothesis this is because environment of Humanoid-v2 is more complex than others, and more flexible quantile Z π is needed, which could model the environment more accurately.

5.3. SPARSE MUJOCO TASKS

To further show the strength of our algorithm regarding exploration, we evaluate on the sparse version of Mujoco tasks. Specifically, the reward is +1 when the move-distance thresholdfoot_0 is reached, otherwise 0. Obviously, since we set the maximum length of episode to 100, the maximum average episode return in the sparse tasks is up to 100. As shown in Figure 1 and Appendix F, SAC performs poorly in those tasks, which is to be expected, since solving the sparse reward problem requires not only more accurate estimates of critic, but also more efficient exploration. Even in sparse HumanoidStandup-v2 task, when our algorithm MQES reaches nearly maximum scores, SAC learns almost nothing, and DSAC performs obviously a bit inferior. As can be seen in combination with the Figure 1 and Table 3 , both MQES algorithms perform the best, and both do achieve significantly better results faster than DSAC. MQES performs extremely well in these sparse environments, which on the one hand demonstrates the importance of exploration when solving sparse reward problems, and on the other hand shows the advantages and capabilities of MQES for efficient exploration, presenting that incorporating uncertainty to exploration could render better performance.

5.4. ABLATION STUDY

In this section, we conduct two ablation experiments to show the performance gain by distinguishing aleatoric and epistemic uncertaity and the sensitivity to the hyper-parameters.

5.4.1. GAIN OF ALEATORIC UNCERTAINTY

The exploration with epistemic uncertainty is proved to be efficient (Ciosek et al., 2019) by avoiding the under-exploration. Hence, we present the gain brought by the aleatoric uncertainty by conducting ablation study in this section. Generally, MQES-based actor-critic algorithm degrades to Optimistic Actor-Critic (OAC) (Ciosek et al., 2019) if m i = 1 in Eq. 13. Consequentially, we could compare MQES with OAC to show the necessary of introducing the aleatoric uncertainty to exploration. In order to show that MQES performs robustly in the face of aleatoric uncertainty, we modified the standard Ant-v2 task by adding heterogeneous noise to the observation to increase the aleatoric uncertainty of the environment. We also extend the OAC to the distributional form (DOAC, Distributional OAC), i.e., to estimate the values in the same way as MQES, in order to ensure a fair comparison. As shown in Figure 2 , DOAC does not take into account the aleatoric uncertainty, resulting in its performance being close to that of DSAC. The experimental results confirm that MQES is effective in avoiding the interference of aleatoric uncertainty in the environment and ensuring exploration efficiency. Figure 2 : Gain of aleatoric uncertainty, keeping the same plotting settings as Figure 1 .

5.4.2. ABLATION STUDY ON HYPER-PARAMETERS

MQES is sensitive to √ 2α, which controls the distance between the behavior policy π E and the target policy π T . If √ 2α is quite small, then the MQES degenerates to DSAC and shows little exploration, and if √ 2α is larger, then the performance becomes worse because of the gap between behavior policy and target policy. We evaluate the sensitivity to hyperparameters on Ant-v2 task using MQES G, and the sensitivity to the constraint √ 2α is shown in Figure 3 , and the sensitivity to β is shown in Appendix F, where the error bar indicates half standard deviation. Figure 3 : Ablation study on √ 2α.

6. CONCLUSION

In this paper, we propose MQES, a generally exploration principle for continuous RL algorithms, which formulates the exploration policy to maximize the information about the globally optimal distribution of Q function. To make MQES practically tractable, we firstly incorporate distributional and ensemble Q function approximations to MQES, which could formulate the epistemic and aleatoric uncertainty accordingly. Secondly, we introduce a constraint to stabilize the training, and solve the constrained MQES problem to derive the exploration policy in closed form. Then, we analyze and show that it explores optimistically and avoid over-exploration by recognizing the epistemic and aleatoric uncertainty, respectively. Empirical evaluations show that MQES works better at the complex environments, where the exploration is needed.

A PROOF OF PROPOSITION 1

Proof. The proof is similar to Wang & Jegelka (2017) . Here, we utilize Z π to criticize π, i.e., p(a|z * (s, a * ), s) = π(a|s)E z π ∼Z π 1 z π (s,a)≤z * (s,a * ) C , where C is the normalization factor, and E Z π (s,a) 1 z π (s,a)≤z * (s,a * ) = Φ Z π (s,a) (z * (s, a * )) B PROOF OF PROPOSITION 2 Proof. Inspired by the methods of multipliers, it is equivalent to maximize objective function Fπ (s, a) and minimize constraint KL(π||π T ) simultaneously. Firstly, we minimize the KL(π||π T ) = KL(N (µ, Σ)||N (µ T , Σ T )) against the variance matrix Σ, i.e., min Σ KL(N (µ, Σ)||N (µ T , Σ T )). The optimal solution for problem ( 22) is Σ E = Σ T . Then, Fπ (s, a) is expanded linearly around a t = µ T : Fπ (s, a) ≈ a T ∇ a Fπ (s, a)| a=µ T + const (24) = a T m ∂ ẑ * (s, a) ∂a | a=µ T + const, where the element of vector m = { mi } n i=1 is mi = 1 C φ Z π (s,µ T ) (ẑ * (s, µ T ))(log Φ Z π (s,µ T ) (ẑ * (s, µ T )) C (2π)Σ ii + 1), n is the dimension of action, and φ(x) is the probability distribution function (pdf). Then, problem (12) is reformulated as: max µ E Ẑ * µ T m ∂ ẑ * (s, a) ∂a | a=µ T s.t. 1 2 (µ -µ T ) T Σ -1 (µ -µ T ) ≤ α, Then, the Lagrange function of problem ( 27) is given as: L(µ) = E ẑ * µ T m ∂ ẑ * (s, a) ∂a | a=µ T -λ 1 2 (µ -µ T ) T Σ -1 (µ -µ T ) -α . According to the KKT condition, we derive the following equations: First, from the stationary, i.e., ∇ µ L(µ) = m ∂ Ẑ * (s,a) ∂a | a=µ T -λΣ -1 (µ -µ T ) = 0, we get µ = µ T + 1 λ ΣE Ẑ * m ∂ ẑ * (s, a) ∂a | a=µ T , where λ > 0. Secondly, since λ > 0, the constraint is active, i.e., (µ -µ T ) T Σ -1 (µ -µ T ) = 2α. Together with Eq. 29, we get the following equation: λ = E ẑ * m ∂ Ẑ * (s,a) ∂a | a=µ T Σ √ 2α . Finally, ( 13) is obtained by plugging (30) to (29). C ALGORITHM 2: MQES FOR DSAC 

D HYPER-PARAMETERS SETTING

The hyper-parameters in our experiment are guaranteed to be consistent, as shown in Tab. 1.

E THRESHOLD SETTINGS FOR SPARSE TASKS

As illustrated in Sec. 5.3, We set the threshold according to the statistical analysis of untrained interaction behavior, using 99.9 quantile value, as shown in Tab. 2. 

F ADDITIONAL EXPERIMENT RESULTS

Limited by the length of the text, we present the results of the evaluation on some simple tasks in Fig. 5 . All evaluation data can be seen in Tab. 3. The sensitivity to β is shown in Fig. 4 . β controls the uncertainty magnitude, the smaller the value, the smaller the degree of optimism or pessimism, and vice versa. We evaluate the effect of β in Ant-v2 task using M QES Q. What can be seen is that larger β will degrade performance. Although smaller β are more profitable in the Ant-v2 task, it is not necessarily in other task, so we set a uniform 1.6 in our experiments. 



We set the threshold according to the statistical analysis of untrained interaction behavior, using 99.9% quantile value, and the threshold of all environments are shown in Appendix E



Figure 1: Training curves on continuous control benchmarks in Mujoco. The x-axis indicates number of training epoch (100 environment steps for each training epoch), while the y-axis is the evaluation result represented by average episode return. The shaded region denotes one standard deviation of average evaluation over 5 seeds. Curves are smoothed uniformly for visual clarity.

Figure 4: Ablation study on β

Figure 5: Training curves on continuous control benchmarks in Mujoco. The x-axis indicates number of training epoch (100 environment steps for each training epoch), while the y-axis is the evaluation result represented by average episode return. The shaded region denotes one standard deviation of average evaluation over 5 seeds. Curves are smoothed uniformly for visual clarity.

But, it only affects the choice of hyper-parameter β and do not affect the final performance. : Calculate epistemic uncertainty σ epistemic (s t , a t ) according to Eq. 15 3: Calculate upper bound Q UB (s t , a t ) using σ epistemic (s t , a t ) according to Eq. 16 4:

Algorithm 2 MQES for DSAC Initialise: Value networks θ 1 , θ 2 , policy network φ and their target networks θ1 , θ2 , φ, quantiles number N, target smoothing coefficient (τ ), discount (γ), an empty replay pool D 1: for each iteration do MQES parameters

Threshold settings for sparse reward tasks

Average return over 5 seeds with one standard deviation at corresponding training step, i.e., 1.25 x 10 5 million training step for Hopper-v2. The maximum value of each row is shown in bold.

