DIVERSITY ACTOR-CRITIC: SAMPLE-AWARE ENTROPY REGULARIZATION FOR SAMPLE-EFFICIENT EXPLORATION Anonymous

Abstract

Policy entropy regularization is commonly used for better exploration in deep reinforcement learning (RL). However, policy entropy regularization is sampleinefficient in off-policy learning since it does not take the distribution of previous samples stored in the replay buffer into account. In order to take advantage of the previous sample distribution from the replay buffer for sample-efficient exploration, we propose sample-aware entropy regularization which maximizes the entropy of weighted sum of the policy action distribution and the sample action distribution from the replay buffer. We formulate the problem of sample-aware entropy regularized policy iteration, prove its convergence, and provide a practical algorithm named diversity actor-critic (DAC) which is a generalization of soft actor-critic (SAC). Numerical results show that DAC significantly outperforms SAC baselines and other state-of-the-art RL algorithms.

1. INTRODUCTION

Reinforcement learning (RL) aims to maximize the expectation of the discounted reward sum under Markov decision process (MDP) environments (Sutton & Barto, 1998) . When the given task is complex, i.e. the environment has high action-dimensions or sparse rewards, it is important to well explore state-action pairs for high performance (Agre & Rosenschein, 1996) . For better exploration, recent RL considers various methods: maximizing the policy entropy to take actions more uniformly (Ziebart et al., 2008; Fox et al., 2015; Haarnoja et al., 2017) , maximizing diversity gain that yields intrinsic rewards to explore rare states by counting the number of visiting states (Strehl & Littman, 2008; Lopes et al., 2012) , maximizing information gain (Houthooft et al., 2016; Hong et al., 2018) , maximizing model prediction error (Achiam & Sastry, 2017; Pathak et al., 2017) , and so on. In particular, based on policy iteration for soft Q-learning, (Haarnoja et al., 2018a ) considered an offpolicy actor-critic framework for maximum entropy RL and proposed the soft actor-critic (SAC) algorithm, which has competitive performance for challenging continuous control tasks. In this paper, we reconsider the problem of policy entropy regularization in off-policy learning and propose a generalized approach to policy entropy regularization. In off-policy learning, we store and reuse old samples to update the current policy (Mnih et al., 2015) , and it is preferable that the old sample distribution in the replay buffer is uniformly distributed for better performance. However, the simple policy entropy regularization tries to maximize the entropy of the current policy irrespective of the distribution of previous samples. Since the uniform distribution has maximum entropy, the current policy will choose previously less-sampled actions and more-sampled actions with the same probability and hence the simple policy entropy regularization is sample-unaware and sample-inefficient. In order to overcome this drawback, we propose sample-aware entropy regularization, which tries to maximize the weighted sum of the current policy action distribution and the sample action distribution from the replay buffer. We will show that the proposed sampleaware entropy regularization reduces to maximizing the sum of the policy entropy and the α-skewed Jensen-Shannon divergence (Nielsen, 2019) between the policy distribution and the buffer sample action distribution, and hence it generalizes SAC. We will also show that properly exploiting the sample action distribution in addition to the policy entropy over learning phases will yield far better performance.

2. RELATED WORKS

Entropy regularization: Entropy regularization maximizes the sum of the expected return and the policy action entropy. It encourages the agent to visit the action space uniformly for each given state, and the regularized policy is robust to modeling error (Ziebart, 2010) . Entropy regularization is considered in various domains for better optimization: inverse reinforcement learning (Ziebart et al., 2008) , stochastic optimal control problems (Todorov, 2008; Toussaint, 2009; Rawlik et al., 2013) , and off-policy reinforcement learning (Fox et al., 2015; Haarnoja et al., 2017) . (Lee et al., 2019) shows that Tsallis entropy regularization that generalizes usual Shannon-entropy regularization is helpful. (Nachum et al., 2017a) shows that there exists a connection between value-based and policybased RL under entropy regularization. (O'Donoghue et al., 2016) proposed an algorithm combining them, and it is proven that they are equivalent (Schulman et al., 2017a) . The entropy of state mixture distribution is better for pure exploration than a simple random policy (Hazan et al., 2019) . Diversity gain: Diversity gain is used to provide a guidance for exploration to the agent. To achieve diversity gain, many intrinsically-motivated approaches and intrinsic reward design methods have been considered, e.g., intrinsic reward based on curiosity (Chentanez et al., 2005; Baldassarre & Mirolli, 2013) , model prediction error (Achiam & Sastry, 2017; Pathak et al., 2017; Burda et al., 2018) , divergence/information gain (Houthooft et al., 2016; Hong et al., 2018) , counting (Strehl & Littman, 2008; Lopes et al., 2012; Tang et al., 2017; Martin et al., 2017) , and unification of them (Bellemare et al., 2016) . For self-imitation learning, (Gangwani et al., 2018) considered the Steinvariational gradient decent with the Jensen-Shannon kennel. Off-policy learning: Off-policy learning can reuse any samples generated from behaviour policies for the policy update (Sutton & Barto, 1998; Degris et al., 2012) , so it is sample-efficient as compared to on-policy learning. In order to reuse old samples, a replay buffer that stores trajectories generated by previous policies is used for Q-learning (Mnih et al., 2015; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018a) . To enhance both stability and sample-efficiency, several methods are considered, e.g., combining on-policy and off-policy (Wang et al., 2016; Gu et al., 2016; 2017) , and generalization from on-policy to off-policy (Nachum et al., 2017b; Han & Sung, 2019) . In order to guarantee the convergence of Q-learning, there is a key assumption: Each state-action pair must be visited infinitely often (Watkins & Dayan, 1992) . If the policy does not visit diverse state-action pairs many times, it converges to local optima. Therefore, exploration for visiting different state-action pairs is important for RL, and the original policy entropy regularization encourages exploration (Ahmed et al., 2019) . However, we found that the simple policy entropy regularization can be sample-inefficient in off-policy RL, so we aim to propose a new entropy regularization method that significantly enhances the sample-efficiency for exploration by considering the previous sample distribution in the buffer.

3. BACKGROUND

In this section, we briefly introduce the basic setup and the soft actor-critic (SAC) algorithm.

3.1. SETUP

We assume a basic RL setup composed of an environment and an agent. The environment follows an infinite horizon Markov decision process (S, A, P, γ, r), where S is the state space, A is the action space, P is the transition probability, γ is the discount factor, and r : S × A → R is the reward function. In this paper, we consider a continuous state-action space. The agent has a policy distribution π : S × A → [0, ∞) which selects an action a t for a given state s t at each time step t, and the agent interacts with the environment and receives reward r t := r(s t , a t ) from the environment. Standard RL aims to maximize the discounted return E s0∼p0,τ0∼π [ ∞ t=0 γ t r t ], where τ t = (s t , a t , s t+1 , a t+1 • • • ) is an episode trajectory.

3.2. SOFT ACTOR-CRITIC

Soft actor-critic (SAC) (Haarnoja et al., 2018a) includes a policy entropy regularization term in the objective function for better exploration by visiting the action space uniformly for each given state. The entropy-augmented policy objective function of SAC is given by J SAC (π) = E τ0∼π ∞ t=0 γ t (r t + βH(π(•|s t ))) , where H is the entropy function and β ∈ (0, ∞) is the entropy coefficient. SAC is a practical offpolicy actor-critic based on soft policy iteration (SPI) that alternates soft policy evaluation to estimate the true soft Q-function and soft policy improvement to find the optimal policy that maximizes (1). In addition, SPI theoretically guarantees convergence to the optimal policy that maximizes (1).

4.1. MOTIVATION OF THE SAMPLE-AWARE ENTROPY

As explained in Section 2, the policy should visit diverse samples to learn the policy without converging to the local optima. In off-policy learning, we can reuse previous samples stored in the replay buffer to learn the policy, so it is efficient to draw diverse samples while avoiding frequently selected samples before. The policy entropy maximization enhances exploration to yield better performance, but it is sample-inefficient for off-policy RL because it does not take advantage of the previous sample action distribution obtainable from the replay buffer: If we assume bounded action space, the simple policy entropy maximization will choose all actions with the equal probability without considering the previous action samples because max π H(π) = min π D KL (π||U ) is achieved when π = U , where U is a uniform distribution and D KL is the Kullback-Leibler (KL) divergence. In order to overcome the limitation of the simple policy entropy maximization, we consider maximizing a sample-aware entropy defined as the entropy of a mixture distribution of the policy distribution π and the current sample action distribution q in the replay buffer. Here, q is defined as q(•|s) := a∈D N (s, a)δ a (•) a ∈D N (s, a ) , ( ) where D is the replay buffer that stores previous samples (s t , a t , r t , s t+1 ) at each time t, δ a (•) is the Dirac measure at a ∈ A, and N (s, a) is the number of state-action pair (s, a) in D. Then, we define a target distribution q π,α target as the mixture distribution of π and q, which is expressed as q π,α target := απ + (1 -α)q, where α ∈ [0, 1] is the weighting factor. Note that we draw samples from policy π and store them in the replay buffer, so the target distribution can be viewed as the updated sample action distribution in the future replay buffer. Then, maximizing the sample-aware entropy H(q π,α target ) can encourage sample-efficient exploration because π will choose actions rare in the buffer with high probability and actions stored many times in the buffer with low probability in order to make the target distribution uniform. We provide a simple example below: Let us consider a simple 1-step MDP in which s 0 is the unique initial state, there exist N a actions (A = {A 0 , • • • , A Na-1 }), s 1 is the terminal state, and r is a deterministic reward function. Then, there exist N a state-action pairs in total and let us assume that we already have N a -1 state-action samples in the replay buffer as R = {(s 0 , A 0 , r(s 0 , A 0 )), • • • , (s 0 , A Na-2 , r(s 0 , A Na-2 ))}. In order to estimate the Q-function for all state-action pairs, the policy should sample the last action A Na-1 (After then, we can reuse all samples infinitely to estimate Q). Here, we will compare two exploration methods. 1) First, if we consider the simple entropy maximization, the policy that maximizes its entropy will choose all actions with equal probability 1/N a (uniformly). Then, N a samples should be taken on average by the policy to visit the action A Na-1 . 2) Consider the sample-aware entropy maximization. Here, the sample action distribution q in the buffer becomes q(a 0 |s 0 ) = 1/(N a -1) for a 0 ∈ {A 0 , • • • , A Na-2 } and q(A Na-1 |s 0 ) = 0, the target distribution becomes q π,α target = απ + (1 -α)q, and we set α = 1/N a . Then, the policy that maximizes the sample-aware entropy becomes π(A Na-1 |s 0 ) = 1 to make q π,α target uniform because max π H(q π,α target ) = min π D KL (q π,α target ||U ). In this case, we only needs one sample to visit the action A Na-1 . In this way, the simple entropy maximization is sample-inefficient for off-policy RL, and the proposed sample-aware entropy maximization can enhance the sample-efficiency for exploration by using the previous sample distribution and choosing a proper α. With this motivation, we propose the sample-aware entropy regularization for off-policy RL and the corresponding αadaptation method.

4.2. SAMPLE-AWARE ENTROPY REGULARIZATION

Our approach is to maximize the return while maximizing the sample-aware entropy. Under this approach, previously many times sampled actions will be given low probabilities and previously less taken actions will be given high probabilities by the current policy π for sample-efficient exploration as shown in Section 4.1. Hence, we set the objective function for the proposed sample-aware entropy regularization as J(π) = E τ0∼π ∞ t=0 γ t (r t + βH(q π,α target (•|s t ))) . Here, the sample-aware entropy H(q π,α target ) for given s t can be decomposed as H(q π,α target ) = - a∈A (απ + (1 -α)q) log(απ + (1 -α)q) = αH(π) + D α JS (π||q) + (1 -α)H(q), where D α JS (π||q) := αD KL (π||q π,α target )+(1-α)D KL (q||q π,α target ) is the α skew-symmetric Jensen-Shannon (JS) divergence (Nielsen, 2019) . Note that D α JS reduces to the standard JS divergence for α = 1 2 and to zero for α = 0 or 1. Hence, for α = 1, (4) reduces to the simple entropy, but for α = 1, it is a generalization incorporating the distribution q. Thus, our objective function aims to maximize the return while simultaneously maximizing the discounted sum of policy entropy, sample entropy, and the divergence between π and q. In this way, the policy will choose more diverse actions that are far from the samples stored in the replay buffer while maintaining its entropy for better exploration.

4.3. DIVERSE POLICY ITERATION WITH THE PROPOSED OBJECTIVE

In this section, we derive the diverse policy evaluation and diverse policy improvement to maximize the objective function with the sample-aware entropy regularization (3). Note that the sample action distribution q is updated as the iteration goes on. However, it changes very slowly since the buffer size is much larger than the time steps of one iteration. Hence, for the purpose of proof we regard the action distribution q as a fixed distribution in this section.

First, we define the true diverse

Q-function Q π as Q π (s t , a t ) := 1 β r t + E τt+1∼π ∞ l=t+1 γ l-t-1 1 β r l + αH(π(•|s l )) + D α JS (π(•|s l )||q(•|s l )) + (1 -α)H(q(•|s l )) . We defined the sample distribution q in equation ( 2), but we do not want to compute actual q, which requires a method such as discretization and counting for continuous samples. Even if q is obtained by counting, a generalization of q for arbitrary state-action pairs is needed again to estimate Q π . We circumvented this difficulty by defining the ratio R π,α of απ to q π,α target as R π,α (s t , a t ) = απ(a t |s t ) απ(a t |s t ) + (1 -α)q(a t |s t ) , and we will show later that all objective (or loss) functions for practical implementation can be represented by using the ratio only, without using the explicit q in Appendix B. Then, we can decompose D α JS (π(•|s l )||q(•|s l )) as D α JS (π||q) = αE a l ∼π(•|s l ) [log R π,α (s l , a l )] + (1 -α)E a l ∼q(•|s l ) [log(1 -R π,α (s l , a l ))] + H(α), where H(α) = -α log α -(1 -α) log(1 -α) is the binary entropy function. The modified Bellman backup operator for Q π estimation is given by T π Q(s t , a t ) := 1 β r t + γE st+1∼P [V (s t+1 )], where V (s t ) = E at∼π [Q(s t , a t ) + α log R π,α (s t , a t ) -α log απ(a t |s t )] + (1 -α)E at∼q [log(1 - R π,α (s t , a t )) -log(1 -α)q(a t |s t )] is an estimated diverse state value function, Q : S × A → R is an estimated diverse state-action value function. Proof of the convergence of diverse policy evaluation that estimates Q π by repeating the Bellman operator ( 7) is provided in Appendix A. Then, the policy is updated from π old to π new as π new = arg max π J π old (π), where J π old (π) is the objective of π estimated under Q π old defined asfoot_0  J π old (π(•|s t )) := β{E at∼π [Q π old (s t , a t ) + α log R π,α (s t , a t ) -α log απ(a t |s t )] + (1 -α)E at∼q [log(1 -R π,α (s t , a t )) -log(1 -α)q(a t |s t )]}. The monotone improvement of this step is proved in Appendix A. Now, we can find the optimal policy that maximizes J(π)(= J π (π)) by the following theorem: Theorem 1 (Diverse Policy Iteration) By repeating iteration of the diverse policy evaluation and the diverse policy improvement, any initial policy converges to the optimal policy π * s.t. Q π * (s t , a t ) ≥ Q π (s t , a t ), ∀ π ∈ Π, ∀ (s t , a t ) ∈ S × A. Also, such π * achieves maximum J, i.e., J π * (π * ) ≥ J π (π) for any π ∈ Π. Proof. See Appendix A.1. Note that J π old (π) for diverse policy iteration above requires the ratio function R π,α of the current policy π, but we can only estimate R π old ,α for the previous policy π old in practice. Thus, we circumvent this difficulty by defining a practical objective function J π old (π) given by Jπ old (π(•|s t )) := βE at∼π [Q π old (s t , a t ) + α log R π old ,α (s t , a t ) -α log π(a t |s t )], Regarding the practically computable objective function Jπ old (π), we have the following result: Theorem 2 Suppose that the policy is parameterized with parameter θ. For parameterized policy π θ , two objective functions J π θ old (π θ (•|s t )) and Jπ θ old (π θ (•|s t )) have the same gradient direction for θ at θ = θ old for all s t ∈ S. Proof. See Appendix A.2. By Theorem 2, we can replace the objective function J π old (π) of policy improvement with the practically computable objective function Jπ old (π) for parameterized policy without loss of optimality.

4.4. DIVERSITY ACTOR CRITIC IMPLEMENTATION

We first define R α as an estimate for the ratio function R π old ,α . For implementation, we parameterize π, R α , Q, and V by neural network parameters θ, η, φ, and ψ, respectively. Then, we setup the practical objective (or loss) functions Ĵπ (θ), ĴR α (η), LQ (φ), and LV (ψ) for the parameter update. Detailed DAC implementation based on Section 4 is provided in Appendix B. The proposed DAC algorithm is summarized in Appendix C. Note that DAC becomes SAC when α = 1, and becomes standard off-policy RL without entropy regularization when α = 0.

5. α-ADAPTATION

In the proposed sample-aware entropy regularization, the weighting factor α plays an important role in controlling the ratio between the policy distribution π and the sample action distribution q. However, it is difficult to estimate optimal α directly. Hence, we further propose an adaptation method for α based on max-min principle widely considered in game theory, robust learning, and decision making problems (Chinchuluun et al., 2008) . Since we do not know optimal α, an alternative formulation is that we maximize the return while maximizing the worst-case sample-aware entropy, i.e., min α H(q π,α target ). Then, the max-min approach can be formulated as follows: max π Eτ 0 ∼π t γ t (rt + β min α [H(q π,α target ) -αc]) ( ) where c is a control hyperparameter for α adaptation. We learn α to minimize H(q π,α target ) -αc, so the role of c is to maintain the target entropy at a certain level to explore the state-action well. Detailed implementation for α-adaptation is given in Appendix B.1.

6. EXPERIMENTS

In this section, we evaluate the proposed DAC algorithm on various continuous-action control tasks and provide ablation study. In order to see the superiority of the sample-aware entropy regularization, we here focus on comparison with two SAC baselines: SAC and SAC-Div. SAC-Div is SAC combined with the method in (Hong et al., 2018) that diversifies policies from buffer distribution by simply maximizing J(π) + α d D(π||q) for J(π) in (1) and some divergence D. Note that the key difference between SAC-Div and DAC is that SAC-Div simply adds the single divergence term to the policy objective function J(π), whereas DAC considers the discounted sum of target entropy terms as seen in ( 3). For SAC-Div, we consider KL divergence (MSE if the policy is Gaussian) and adaptive scale α d with δ d = 0.2 for the divergence term as suggested in (Hong et al., 2018) . In order to rule out the influence of factors other than exploration, we use the common simulation setup for DAC and SAC baselines except for the parts about entropy or divergence. In addition, we provide comparison of DAC to random network distillation (RND) (Burda et al., 2018) and MaxEnt (Hazan et al., 2019) , which are the recent exploration methods based on finding rare states in Appendix F.2, and to other recent RL algorithms in Appendix F.3. The result shows that DAC yields the best performance for all considered tasks as compared to recent RL algorithms. We also provide the source code of DAC implementation that requires Python Tensorflow. Detailed simulation setup for experiments is summarized in Appendix E. 

6.2. PERFORMANCE COMPARISON WITH THE SAC BASELINES

The final goal of RL is to achieve high scores for given tasks. For this, exploration techniques are needed to ensure that the policy does not converge to local optima, as explained in Section 2. We first showed the improvement of the exploration performance in a pure exploration task (continuous 4-room maze), and experiments in this section will show that DAC has better return performance than SAC baselines on several sparse-rewarded tasks. Note that having high scores on the sparsereward tasks means that the policy can get rewards well without falling into local optima, which implies that the agent successfully explores more state-action pairs that have positive (or diverse) rewards. Therefore, the performance comparison on sparse tasks fits well to the motivation and also note that sparse-rewarded tasks has been widely used as a verification method of exploration in many previous exploration studies (Hong et al., 2018; Mazoure et al., 2019; Burda et al., 2018) . Figure 3 : α-skewed JS divergence for DAC and SAC/SAC-Div Fixed α case: In order to see the advantage of the sample-aware entropy regularization for rewarded tasks, we compare the performance of DAC with α = 0.5 and the SAC baselines on simple MDP tasks: SparseMujoco tasks. SparseMujoco is a sparse version of Mujoco and the reward is 1 if the agent exceeds the x-axis threshold, otherwise 0 (Hong et al., 2018; Mazoure et al., 2019) . The performance results averaged over 10 random seeds are shown in Fig. 2 . As seen in Fig. 2 , DAC has significant performance gain for most tasks as compared to SAC. On the other hand, SAC-Div also enhances the convergence speed compared to SAC for some tasks, but it fails to enhance the final performance. Fig. 3 shows the α-skewed JS divergence curve (α = 0.5) of DAC and SAC/SAC-Div for sparse Mujoco tasks and we provide Fig. F.1 in Appendix F.1 that shows the corresponding mean number of discretized state visitation curve on sparse Mujoco tasks. For SAC/SAC-Div, the ratio function R is estimated separately from (B.2) in Appendix B and the divergence is computed from R. The performance table for all tasks is given by Table F .1 in Appendix F.1. As seen in Fig. 3 , the divergence of DAC is much higher than that of SAC/SAC-Div throughout the learning time. It means that the policy of DAC choose more diverse actions from the distribution far away from the sample action distribution q, then DAC visits more diverse states than the SAC baselines as seen in Fig. F .1. Thus, DAC encourages better exploration and it yields better performance. Thus, we can conclude that the proposed sample-aware entropy regularization is superior to the simple policy entropy regularization of SAC and single divergence regularization of SAC-Div in terms of exploration and the convergence. Adaptive α case: Now, we compare the performance of DAC with α = 0.5, 0.8, α-adaptation, and the SAC baselines to see the need of α-adaptation. To maintain controllability and prevent saturation η , we used regularization for α learning and restricted the range of α as 0.5 ≤ α ≤ 0.99 for α adaptation so that a certain level of entropy regularization is enforced. Here, we consider more complicated tasks: HumanoidStandup and delayed Mujoco tasks (DelayedHalfCheetah, Delayed-Hopper, DelayedWalker2d, and DelayedAnt). HumanoidStandup is one of high-action dimensional Mujoco tasks. Delayed Mujoco tasks suggested by (Zheng et al., 2018; Guo et al., 2018) have the same state-action spaces with original Mujoco tasks but reward is sparsified. That is, rewards for D time steps are accumulated and the accumulated sum is delivered to the agent once every D time steps, so the agent receives no reward during the accumulation time. The performance results averaged over 5 random seeds are shown in Fig. 4 . The result of the max average return of these Mujoco tasks for DAC and SAC/SAC-Div is provided in Table F .2 in Appendix F.1. As seen in Fig. 4 , all versions of DAC outperform SAC. Here, SAC-Div also outperforms SAC for several tasks, but the performance gain by DAC is much higher. In addition, it is seen that the best α depends on the tasks in the fixed α case. For example, α = 0.8 is the best for DelayedHalfCheetah, but α = 0.5 is the best for DelayedAnt. Thus, we need to adapt α for each task. Finally, DAC with α-adaptation has the top-level performance for most tasks and the best performance for HumanoidStandup and DelayedHopper tasks. Further consideration for α is provided in Section 6.3.

6.3. ABLATION STUDY

In this section, we provide ablation study for important parameters in the sample-aware entropy regularization on the DelayedHalfCheetah task. Ablation studies on the other DelayedMucoco tasks are provided in Appendix G. Weighting factor α: As seen in Section 6.2, α-adaptation is necessary because one particular value of α is not best for all environments. Although the proposed α-adaptation in Section 5 is suboptimal, it shows good performance across all the considered tasks. Thus, we study more on the proposed α-adaptation and the sensible behavior of sample-awareness in entropy regularization. Fig. 5 (a) shows the averaged learning curve of α, α-skewed JS divergence D JS (π||q) and the policy entropy H(π) for DAC with the proposed α-adaptation method on DelayedHalfCheetah. Here, we fix the control coefficient c as -2.0dim(A). As seen in ( 3), the return, the policy entropy and the JS divergence are intertwined in the cost function, so their learning curves are also intertwined over time steps. Here, the learned policy entropy term decreases and the learned α increases to one as time step goes on. Then, the initially nonzero JS divergence term D JS (π||q) diminishes to zero, which means that the sample action distribution is exploited for roughly initial 2.5M time steps, and then DAC operates like SAC. This adaptive exploitation of the sample-aware entropy leads to better overall performance across time steps as seen in Fig. 4 , so DAC with α-adaptation seems to properly exploit both the policy entropy and the sample action distribution depending on the learning stage. Control coefficient c: In the proposed α-adaptation (10), the control coefficient c affects the learning behavior of α. Since H(π) and D α JS are proportional to the action dimension, we tried a few values such as 0, -0.5d, -1.0d and -2.0d where d = dim(A). Fig. 5 (b) shows the corresponding performance of DAC with α-adaptation on DelayedHalfCheetah. As seen in Fig. 5 (b), the performance depends on the change of c as expected, and c = -2.0 • dim(A) performs well. We observed that -2.0d performed well for all considered tasks, thus we set c = -2.0d in (B.8). Entropy coefficient β: As mentioned in (Haarnoja et al., 2018a) , the performance of SAC depends on β. It is expected that the performance of DAC depends on β too. Fig 5(c ) shows the performance of DAC with fixed α = 0.5 for three different values of β: β = 0.1, 0.2 and 0.4 on Delayed-HalfCheetah. It is seen that the performance of DAC indeed depends on β. Although there exists performance difference for DAC depending on β, the performance of DAC is much better than SAC for a wide range of β. One thing to note is that the coefficient of pure policy entropy regularization term for DAC is αβ, as seen in ( 3). Thus, DAC with α = 0.5 and β = 0.4 has the same amount of pure policy entropy regularization as SAC with β = 0.2. However, DAC with α = 0.5 and β = 0.4 has much higher performance than SAC with β = 0.2, as seen in Fig. 5(c ). So, we can see that the performance improvement of DAC comes from joint use of policy entropy H(π) and the sample action distribution from the replay buffer via D α JS (π||q). The effect of JS divergence: In order to see the effect of the JS divergence on the performance, we also provide an additional ablation study that we consider a single JS divergence for SAC-Div by using the ratio function in Section 4. 3. 5(d) shows the performance comparison of SAC, SAC-Div(KL), SAC-Div(JS), and DAC. For SAC-Div(JS), we used δ d = 0.5 for adaptive scaling in (Hong et al., 2018) . As a result, there was no significant difference in performance between SAC-Div with JS divergence and SAC-Div with KL divergence. On the other hand, the DAC still shows a greater performance increase than both SAC-Div(KL) and SAC-Div(JS), and this means that the DAC has more advantages than simply using JS divergence.

7. CONCLUSION AND FUTURE WORKS

In this paper, we have proposed a sample-aware entropy framework for off-policy RL to overcome the limitation of simple policy entropy for sample-efficient exploration. With the sample-aware entropy regularization, we can achieve diversity gain by exploiting sample history in the replay buffer in addition to policy entropy. For practical implementation of sample-aware entropy regularized policy optimization, we have proposed the DAC algorithm with convergence proof. We have also provided an adaptation method for DAC to control the ratio of the sample action distribution to the policy action entropy. DAC is an actor-critic algorithm for sample-aware regularized policy optimization and generalizes SAC. Numerical results show that DAC significantly outperforms SAC baselines in Maze exploration and various Mujoco tasks. For further study, we consider a generalization of our method in order to deal with the entropy of the state-action distribution. Currently, many recent papers only consider one of the entropy of state distribution d π (s) or that of action distribution π(a|s) only since they have much different properties (e.g. the state-based entropy is non-convex on π and the action-based entropy is convex on π). However, both entropies can be handled simultaneously as one fused entropy that deals with the entropy of the state-action distribution, factorized as log d π (s, a) = log d π (s) + log π(a|s). Then, the generalization of our method for the fused entropy may be able to further enhance the exploration performance by considering the exploration on the entire state-action space.

A PROOFS

A.1 PROOF OF THEOREM 1 For a fixed policy π, Q π can be estimated by repeating the Bellman backup operator by Lemma 1. Lemma 1 is based on usual policy evaluation but has a new ingredient of the ratio condition in the sample-aware case. Lemma 1 (Diverse Policy Evaluation) Define a sequence of diverse Q-functions as Q k+1 = T π Q k , k ≥ 0, where π is a fixed policy and Q 0 is a real-valued initial Q. Assume that the action space is bounded, and R π,α (s t , a t ) ∈ (0, 1) for all (s t , a t ) ∈ S × A. Then, the sequence {Q k } converges to the true diverse state-action value Q π . Proof. Let r π,t := 1 β r t + γE st+1∼P [E at+1∼π [α log R π,α (s t+1 , a t+1 ) -α log απ(a t+1 |s t+1 )] + (1 - α)E at+1∼q [log(1 -R π,α (s t+1 , a t+1 )) -log(1 -α)q(a t+1 |s t+1 )]]. Then, we can formulate the standard Bellman equation form for the true Q π as T π Q(s t , a t ) = r π,t + γE s+1∼P, at+1∼π [Q(s t+1 , a t+1 )] (A.1) Under the assumption of a bounded action space and R π,α ∈ (0, 1), the reward r π,t is bounded and the convergence is guaranteed as the usual policy evaluation (Sutton & Barto, 1998; Haarnoja et al., 2018a) . Now, we prove diverse policy improvement in Lemma 2 and diverse policy iteration in Theorem 1 by using J π old (π) in a similar way to usual RL or SAC. Lemma 2 (Diverse Policy Improvement) Let π new be the updated policy obtained by solving π new = arg max π∈Π J π old (π). Then, Q πnew (s t , a t ) ≥ Q π old (s t , a t ), ∀ (s t , a t ) ∈ S × A. Proof. We update the policy to maximize J π old (π), so J π old (π new ) ≥ J π old (π old ). Hence, E at∼πnew [Q π old (s t , a t ) + α log R πnew,α (s t , a t ) -α log απ new (a t |s t )] + (1 -α)E at∼q [log(1 -R πnew,α (s t , a t )) -log(1 -α)q(a t |s t )] ≥E at∼π old [Q π old (s t , a t ) + α log R π old ,α (s t , a t ) -α log απ old (a t |s t )] + (1 -α)E at∼q [log(1 -R π old ,α (s t , a t )) -log(1 -α)q(a t |s t )] =V π old (s t ) (A.2) By repeating the Bellman equation ( 7) and (A.2) at Q π old , Q π old (s t , a t ) = 1 β r t + γE st+1∼P [V π old (s t+1 )] ≤ 1 β r t + γE st+1∼P [E at+1∼πnew [Q π old (s t+1 , a t+1 ) + α log R πnew,α (s t+1 , a t+1 ) -α log απ new (a t+1 |s t+1 )] + (1 -α)E at+1∼q [log(1 -R πnew,α (s t+1 , a t+1 )) -log(1 -α)q(a t+1 |s t+1 )]] . . . ≤ Q πnew (s t , a t ), (A.3) for each (s t , a t ) ∈ S × A. Theorem 1 (Diverse Policy Iteration) By repeating iteration of the diverse policy evaluation and the diverse policy improvement, any initial policy converges to the optimal policy π * s.t. Q π * (s t , a t ) ≥ Q π (s t , a t ), ∀ π ∈ Π, ∀ (s t , a t ) ∈ S × A. Also, such π * achieves maximum J, i.e., J π * (π * ) ≥ J π (π) for any π ∈ Π. Proof. Let {π i : i ≥ 0, π i ∈ Π} be a sequence of policies s.t. π i+1 = arg max π∈Π J πi (π). For arbitrary state action pairs (s, a) ∈ S ×A, {Q πi (s, a)} monotonically increases by Lemma 2 and each Q πi (s, a) is bounded. Also, π i+1 is obtained by the policy improvement that maximizes J πi (π(•|s)), so J πi (π i+1 (•|s)) ≥ J πi (π i (•|s)) as stated in the proof of Lemma 2. From the definition of J π old (π) in ( 8), all terms are the same for J πi+1 (π i+1 (•|s)) and J πi (π i+1 (•|s)) except βE a∼πi+1 [Q πi+1 (s, a)] in J πi+1 (π i+1 (•|s)) and βE a∼πi+1 [Q πi (s, a)] in J πi (π i+1 (•|s)). Since {Q πi (s, a)} monotonically increases, J πi+1 (π i+1 (•|s)) ≥ J πi (π i+1 (•|s)). Finally, J πi+1 (π i+1 (•|s)) ≥ J πi (π i+1 (•|s)) ≥ J πi (π i (•|s) ) for any state s ∈ S, so the sequence {J πi (π i (•|s)} also monotonically increases, and each J πi (π i (•|s)) is bounded because Q-function and the target entropy are bounded. By the monotone convergence theorem, {Q πi } and {J πi (π i )} pointwisely converge to their optimal functions Q * : S × A → R and J * : S → R, respectively. Here, note that J * (s) ≥ J πi (π i (•|s)) for any i because the sequence {J πi (π i )} is monotonically increasing. From the definition of convergent sequence, for arbitrary > 0, there is a large N ≥ 0 s.t. J πi (π i (•|s)) ≥ J * (s) -(1-γ) γ satisfies for all i ≥ N and any s ∈ S.

Now, we can easily show that

J π k (π k (•|s)) ≥ J π k (π(•|s)) -(1-γ) γ for any k > N , any policy π ∈ Π, and any s ∈ S. (If not, J π k (π k+1 ) = max π J π k (π ) ≥ J π k (π), and then J π k+1 (π k+1 (•|s )) ≥ J π k (π k+1 (•|s )) ≥ J π k (π(•|s )) > J π k (π k (•|s )) + (1-γ) γ ≥ J * (s ) for some s ∈ S. Clearly, it contradicts the monotone increase of the sequence {J πi (π i )}.) Then, by the similar way with (A.3), Q π k (s t , a t ) = 1 β r t + γE st+1∼P [V π k (s t+1 )] = 1 β r t + γE st+1∼P [J π k (π k (•|s t+1 ))] ≥ 1 β r t + γE st+1∼P J π k (π(•|s t+1 )) - (1 -γ) γ = 1 β r t + γE st+1∼P [E at+1∼π [Q π k (s t+1 , a t+1 ) + α log R π,α (s t+1 , a t+1 ) -α log απ(a t+1 |s t+1 )] + (1 -α)E at+1∼q [log(1 -R π,α (s t+1 , a t+1 )) -log(1 -α)q(a t+1 |s t+1 )]] -(1 -γ) . . . ≥ Q π (s t , a t ) -. (A.4) Note that the state action pair (s, a), the policy π, and > 0 were arbitrary, so we can conclude that Q π∞ (s, a) ≥ Q π (s, a) for any π ∈ Π and (s, a) ∈ S × A. In addition, we show that J π k (π k (•|s)) ≥ J π k (π(•|s)) -(1-γ) γ , so J π∞ (π ∞ (•|s)) ≥ J π (π(•|s)) for any π ∈ Π and any s ∈ S. Thus, π ∞ is the optimal policy π * , and we can conclude that {π i } converges to the optimal policy π * .

A.2 PROOF OF THEOREM 2

Theorem 2 Suppose that the policy is parameterized with parameter θ. Then, for parameterized policy π θ , two objective functions J π θ old (π θ (•|s t )) and Jπ θ old (π θ (•|s t )) have the same gradient direction for θ at θ = θ old for all s t ∈ S. Proof. Under the parameterization of π θ , the two objective functions become J π θ old (π θ (•|s t )) = β(E at∼π θ [Q π θ old (s t , a t ) + α log R π θ ,α (s t , a t ) -α log π θ (a t |s t )] + (1 -α)E at∼q [log(1 -R π θ ,α (s t , a t )) -log q(a t |s t )]) + H(α) Jπ θ old (π θ (•|s t )) = βE at∼π θ [Q π old (s t , a t ) + α log R π old ,α (s t , a t ) -α log π θ (a t |s t )]. We can ignore the common Q-function and log π θ terms, and the constant terms w.r.t. θ that leads zero gradient in both objective functions. Thus, we only need to show .5) at θ = θ old . Now, the gradient of the left term in (A.5) at θ = θ old can be expressed as ∇ θ [αE at∼π θ [log R π θ ,α ] + (1 -α)E at∼q [log(1 -R π θ ,α )]] = ∇ θ E at∼π θ [α log R π θ old ,α ] (A ∇ θ [αE at∼π θ [log R π θ ,α ] + (1 -α)E at∼q [log(1 -R π θ ,α )]] = αE at∼π θ [log R π θ ,α • ∇ θ log π θ ] + αE at∼π θ [∇ θ log R π θ ,α ] + (1 -α)E at∼q [∇ θ log(1 -R π θ ,α )] = ∇ θ αE at∼π θ [α log R π θ old ,α ] + αE at∼π θ [∇ θ log R π θ ,α ] + (1 -α)E at∼q [∇ θ log(1 -R π θ ,α )]. (A.6) Here, the gradient of the last two terms in (A.6) becomes zero as shown below: αE at∼π θ [∇ θ log R π θ ,α ] + (1 -α)E at∼q [∇ θ log(1 -R π θ ,α )] = αE at∼π θ [∇ θ R π θ ,α /R π θ ,α ] + (1 -α)E at∼q [∇ θ (1 -R π θ ,α )/(1 -R π θ ,α )] = αE at∼π θ [∇ θ R π θ ,α /R π θ ,α ] -(1 -α)E at∼q [∇ θ R π θ ,α /(1 -R π θ ,α )] = αE at∼π θ [∇ θ R π θ ,α /R π θ ,α ] -(1 -α)E at∼q απ θ + (1 -α)q (1 -α)q • ∇ θ R π θ ,α = αE at∼π θ [∇ θ R π θ ,α /R π θ ,α ] -αE at∼π θ απ θ + (1 -α)q απ θ • ∇ θ R π θ ,α = αE at∼π θ [∇ θ R π θ ,α /R π θ ,α ] -αE at∼π θ [∇ θ R π θ ,α /R π θ ,α ] = 0, (A.7) where we used an importance sampling technique at E at∼q [f (s t , a t )] = E at∼π θ q(at|st) π θ (at|st) f (s t , a t ) for Step (1). By (A.6) and (A.7), J π θ old (π θ (•|s t )) and J π θ old (π θ (•|s t )) have the same gradient at θ = θ old .

B DETAILED DAC IMPLEMENTATION

To compute the final objective function (9), we need to estimate Q π old and R π old ,α . Q π old can be estimated by diverse policy evaluation. For estimation of R π old ,α , we use function R α . If we set the objective function of the ratio function as J(R α (s t , •)) = αE at∼π [log R α (s t , a t )] + (1 - α)E at∼q [log(1 -R α (s t , a t ))]. In the α = 0.5 case, Generative Adversarial Network (GAN) (Goodfellow et al., 2014) has shown that the ratio function for α = 0.5 can be estimated by maximizing J(R 0.5 ). By a similar way, we can easily show that maximizing J(R α ) can estimate our ratio function as below: For given s, J(R α (s, •)) = a απ(a|s) log R α (s, a) + (1 -α)q(a|s) log(1 -R α (s, a))da. The integrand is in the form of y → a log y + b log(1 -y) with a = απ and b = (1 -α)q. For any (a, b) ∈ R 2 \(0, 0), the function y → a log y + b log(1 -y) has its maximum at a/(a + b). Thus, the optimal R * ,α maximizing J(R α (s, •)) is R * ,α (s, a) = απ/(απ + (1 -α)q) = R π,α (s t , a t ). Here, note that J(R α ) becomes just an α-skewed Jensen-Shannon (JS) divergence except some constant terms if R α = R π,α . For implementation we use deep neural networks to approximate the policy π, the diverse value functions Q, V , and the ratio function R α , and their network parameters are given by θ, φ, ψ, and η, respectively. Based on Section 4.3 and we provide the practical objective (or loss) functions for parameter update as Ĵπ (θ), ĴR α (η), LQ (φ), and LV (ψ). The objective functions for the policy π and the ratio function R α are respectively given by Ĵπ (θ) = E st∼D, at∼π θ [Q φ (s t , a t ) + α log R α η (s t , a t ) -α log π θ (a t |s t )], (B.1) ĴR α (η) = E st∼D [αE at∼π θ [log R α η (s t , a t )] + (1 -α)E at∼D [log(1 -R α η (s t , a t ))]]. (B.2) Furthermore, based on the Bellman operator, the loss functions for the value functions Q and V are given respectively given by LQ (φ) = E (st, at)∼D 1 2 (Q φ (s t , a t ) -Q(s t , a t )) 2 , (B.3) LV (ψ) = E st∼D 1 2 (V ψ (s t ) -V (s t )) 2 , (B.4) where the target values are defined as Q(s t , a t ) = 1 β r t + γE st+1∼P [V ψ (s t+1 )] (B.5) V (s t ) = E at∼π θ [Q φ (s t , a t ) + α log R α η (s t , a t ) -α log απ θ (a t |s t )] + (1 -α)E at∼D [log(1 -R α η (s t , a t )) -log(1 -α)q(a t |s t )]. (B.6) By using the property of ratio function that satisfies log(1-R π,α )-log(1-α)q = -log(απ +(1α)q) = log R π,α -log απ, we can replace the last term in V (s t ) as (1 -α)E at∼D [log R α η (s t , a t )log απ(a t |s t )]. However, the probability of π for actions sampled from D can have high variance, so we clip the term in the expectation over a t ∼ D by action dimension for stable learning, then the final target value becomes V (s t ) = E at∼π θ [Q φ (s t , a t ) + α log R α η (s t , a t ) -α log απ θ (a t |s t )] + (1 -α)E at∼D [clip(log R α η (s t , a t ) -log απ(a t |s t ), -d, d)], (B.7) where d = dim(A) is the action dimension. We will use (B.7) for implementation. Then, note that all objective (or loss) functions does not require the explicit q, and they can be represented by using the ratio function R α only as explained in Section 4.3. In addition, R α ∈ (0, 1) should be guaranteed in the proof of Theorem 1, and R α ∈ (0, 1) satisfies when π and q are non-zero for all state-action pairs. For practical implementation, we clipped the ratio function as ( , 1 -) for small > 0 since some q values can be close to zero before the replay buffer stores a sufficient amount of samples. π is always non-zero since we consider Gaussian policy. Here, ψ is the network parameter of the target value V ψ updated by exponential moving average (EMA) of ψ for stable learning (Mnih et al., 2015) . Combining all up to now, we propose the diversity actor-critic (DAC) algorithm summarized as Algorithm 1 in Appendix C. Note that DAC becomes SAC when α = 1, and becomes standard off-policy RL without entropy regularization when α = 0. To compute the gradient of Ĵπ (θ), we use the reparameterization trick proposed by (Kingma & Welling, 2013; Haarnoja et al., 2018a) . Note that the policy action a t ∼ π θ is the output of the policy neural network with parameter θ. So, it can be viewed as a t = f θ ( t ; s t ), where f is a function parameterized by θ and t is a noise vector sampled from spherical normal distribution N . Then, the gradient of Ĵπ (θ) is represented as ∇ θ Ĵπ (θ) = E st∼D, t∼N [∇ a (Q φ (s t , a) + α log R α η (s t , a) - α log π θ (a|s t ))| a=f θ ( t;st) ∇ θ f θ ( t ; s t ) -α(∇ θ log π θ )(f θ ( t ; s t )|s t )]. For implementation, we use two Q-functions Q φi , i = 1, 2 to reduce overestimation bias as proposed in (Fujimoto et al., 2018) , and each Q-function is updated to minimize their loss function LQ (φ i ). For the policy and the value function update, the minimum of two Q-functions is used (Haarnoja et al., 2018a) . Note that one version of SAC (Haarnoja et al., 2018b) considers adaptation of the entropy control factor β by using the Lagrangian method with constraint H(π) ≥ c. In our case, this approach can also be generalized, but it is beyond the scope of the current paper and we only consider fixed β in this paper.

B.1 DETAILED IMPLEMENTATION OF THE α-ADAPTATION

In order to learn α, we parameterize α as a function of s t using parameter ξ, i.e., α = α ξ (s t ), and implement α ξ (s t ) with a neural network. Then, ξ is updated to minimize the following loss function deduced from (10): Lα (ξ) = E st∼D [α ξ H(π θ ) + D α ξ JS (π θ ||q) + (1 -α ξ )H(q) -α ξ c] (B.8) Here, all the updates for diverse policy iteration is the same except that α is replaced with α ξ (s t ). Then, the gradient of Lα (ξ) with respect to ξ can be estimated as below: The loss function of α is defined as Lα (ξ) = E st∼D [α ξ H(π θ )+D α ξ JS (π θ ||q)+(1-α ξ )H(q)-α ξ c]. The gradient of Lα (ξ) can be computed as  ∇ ξ Lα (ξ) = ∇ ξ E st∼D [α ξ H(π θ ) + D α ξ JS (π θ ||q) + (1 -α ξ )H(q) -α ξ c] =∇ ξ E st∼D [α ξ E at∼π θ [-log(α ξ π θ + (1 -α ξ )q) -c] + (1 -α ξ )E at∼q [-log(α ξ π θ + (1 -α ξ )q)]] =E st∼D [(∇ ξ α ξ )(E at∼π θ [-log(α ξ π θ + (1 -α ξ )q) -c] -E at∼q [-log(α ξ π θ + (1 -α ξ )q)])] + E st∼D [α ξ E at∼π θ [-∇ ξ log(α ξ π θ + (1 -α ξ )q)] + (1 -α ξ )E at∼q [-∇ ξ log(α ξ π θ + (1 -α ξ )q)]] =E st∼D [(∇ ξ α ξ )(E at∼π θ [-log α ξ π θ + log R π θ ,α ξ -c] -E at∼q [log(1 -R π θ ,α ξ ) -log(1 -α ξ )q])] + E st∼D at∈A (α ξ π θ + (1 -α ξ )q)[-∇ ξ log(α ξ π θ + (1 -α ξ )q) =0 ] =E st∼D [(∇ ξ α ξ )(E at∼π θ [-log α ξ π θ + log R π θ ,α ξ -c] -E at∼q [log R π θ ,α ξ -log α ξ π θ ])] (B.9) Note that R π θ ,

E SIMULATION SETUP

We compared our DAC algorithm with the SAC baselines and other RL algorithms on various types of Mujoco tasks with continuous action spaces (Todorov et al., 2012) in OpenAI GYM (Brockman et al., 2016) . For fairness, both SAC/SAC-Div and DAC used a common hyperparameter setup that basically follows the setup in (Haarnoja et al., 2018a) . Detailed hyperparameter setup and environment description are provided in Appendix D, and the entropy coefficient β is selected based on the ablation study in Section 6.3. For the policy space Π we considered Gaussian policy set widely considered in usual continuous RL. For the performance plots in this section, we used deterministic evaluation which generated an episode by deterministic policy for each iteration, and the shaded region in the figure represents standard deviation (1σ) from the mean.

F PERFORMANCE COMPARISONS

In this section, we provide more performance plots and tables. In Section F.1, Fig. We first compared the pure exploration performance of DAC to random network distillation (RND) (Burda et al., 2018) and MaxEnt (Hazan et al., 2019) , which are state-of-the-art exploration methods, on the continuous 4-room maze task described in Section 6.1. RND adds an intrinsic reward r int,t to MDP extrinsic reward r t as r RN D,t = r t + c int r int,t based on the model prediction error r int,t = || f (s t+1 ) -f (s t+1 )|| 2 of prediction network f and random target network f for given state s t+1 . The parameter of the target network is initially given randomly and the prediction network learns to minimize the MSE of the two models. Then, the agent goes to rare states since rare states have higher prediction errors. On the other hand, MaxEnt considers maximizing the entropy of state mixture distribution d π mix by setting the reward functional in (Hazan et al., 2019) as -log d πmix (s) + c M , where d π is a state distribution of the trajectory generated from π and c M is a smoothing constant. Here, MaxEnt mainly considers large or continous state space, so the reward functional is computed based on several projection/discretization methods. Then, MaxEnt explores the state space better than a simple random policy on various tasks with continuous state space. For RND, for both the prediction network and the target network, we use MLP with 2 ReLu hidden layers of size 256, where the input dimension is equal to the state dimension and the output dimension is 20, and we use c int = 1. For MaxEnt, we compute the reward functional at each iteration by using Kernel density estimation with a bandwidth 0.1 as stated in (Hazan et al., 2019) on previous 10000 states stored in the buffer, and we use c M = 0.01. For RND and MaxEnt, we change the entropy term of SAC/DAC to the intrinsic reward and the reward functional term respectively, and we use the Gaussian policy with fixed standard deviation σ = 0.1. 



Note that if we replace π old with π and view every state st as an initial state, then (8) reduces to J(π).



Figure 1: Pure exploration task: Continuous 4-room maze 6.1 PURE EXPLORATION COMPARISONIn order to see the exploration performance of DAC (α = 0.5) as compared to the SAC baselines, we compare state visitation on a 100 × 100 continuous 4-room maze task. The maze environment is made by modifying a continuous grid map available at https://github.com/huyaoyu/ GridMap, and it is shown in Fig.1(a). State is (x, y) position in the maze, action is (dx, dy) bounded by [-1, 1], and the agent location after the action becomes (x + dx, y + dy). The agent starts from the left lower corner (0.5, 0.5) and explores the maze without any reward, and Fig.1(b)shows the mean number of new state visitations over 30 seeds, where the number of state visitation is obtained for each integer interval. As seen in Fig.1(b), DAC visited much more states than SAC/SAC-Div, which means that the exploration performance of DAC is superior to that of the SAC baselines. In addition, Fig.1(c) shows the corresponding state visit histogram of all seeds. Here, as the color of the state becomes brighter, the state is visited more times. Note that SAC/SAC-Div rarely visit the right upper room even at 500k time steps for all seeds, but DAC starts visiting the

Figure 2: Performance comparison: Fixed α case

Figure 4: Performance comparison: Adaptive α case

F.1 shows the mean number of discretized state visitation curve of DAC and SAC/SAC-Div. For discretization, we simply consider 2 components of observations of Mujoco tasks, which indicate the position of the agent: x, z axis position for SparseHalfCheetah, SparseHopper, and SparseWalker, and x, y axis position for SparseAnt. We discretize the position by setting the grid spacing per axis to 0.01 in range (-10, 10).

Figure F.1: The number of discretized state visitation on sparse Mujoco tasks

Fig. F.2(a) shows the mean number of state visitation curve over 30 seeds of the 4-room maze task and Fig. F.2(b) shows the corresponding state visit histogram of all seeds. As seen in Fig. F.2, DAC explores more number of states than RND and MaxEnt on continuous 4-room maze task, so it is seen that the exploration of DAC is more sample-efficient than that of RND/MaxEnt on the maze task.

α ξ can be estimated by the ratio function R 2: State and action dimensions of Mujoco tasks and the corresponding β

Table. F.1 shows the performance on sparse Mujoco tasks. Table F.2 shows max average return for HumanoidStandup and delayed Mujoco tasks. In Section F.3, Fig. F.3 and Table. F.3 shows the performance comparison to other RL algorithms on HumanoidStandup and delayed Mujoco tasks.

2: Max average return of DAC algorithms and SAC baselines for adaptive α setup F.2 COMPARISON TO RND AND MAXENT

C ALGORITHM

Algorithm 1 Diversity Actor Critic Initialize parameter θ, η, ψ, ψ, ξ, φ i , i = 1, 2 for each iteration do Sample a trajectory τ of length N by using π θ Store the trajectory τ in the buffer D for each gradient step do Sample random minibatch of size M from D Compute Ĵπ (θ), ĴR α (η), LQ (φ i ), LV (ψ) from the minibatch θ ← θ + δ∇ θ Ĵπ (θ)Update ψ by EMA from ψ if α-Adpatation then Compute Lα (ξ) from the minibatch ξ ← ξ -δ∇ ξ Lα (ξ) end if end for end for

D HYPERPARAMETER SETUP AND ENVIRONMENT DESCRIPTION

In Table D .1, we provide the detailed hyperparameter setup for DAC and the SAC baselines: SAC, and SAC-Div. We also compare the performance of DAC with α-adaptation to other state-of-the-art RL algorithms.Here, we consider various on-policy RL algorithms: Proximal Policy Optimization (Schulman et al., 2017b) (PPO, a stable and popular on-policy algorithm), Actor Critic using Kronecker-factored Trust Region (Wu et al., 2017) (ACKTR, actor-critic that approximates natural gradient by using Kronecker-factored curvature), and off-policy RL algorithms: Twin Delayed Deep Deterministic Policy Gradient (Fujimoto et al., 2018 ) (TD3, using clipped double-Q learning for reducing overestimation); and Soft Q-Learning (Haarnoja et al., 2017 ) (SQL, energy based policy optimization using Stein variational gradient descent). We used implementations in OpenAI baselines (Dhariwal et al., 2017) for PPO and ACKTR, and implementations in author's Github for other algorithms. We provide the performance results as 

