DIVERSITY ACTOR-CRITIC: SAMPLE-AWARE ENTROPY REGULARIZATION FOR SAMPLE-EFFICIENT EXPLORATION Anonymous

Abstract

Policy entropy regularization is commonly used for better exploration in deep reinforcement learning (RL). However, policy entropy regularization is sampleinefficient in off-policy learning since it does not take the distribution of previous samples stored in the replay buffer into account. In order to take advantage of the previous sample distribution from the replay buffer for sample-efficient exploration, we propose sample-aware entropy regularization which maximizes the entropy of weighted sum of the policy action distribution and the sample action distribution from the replay buffer. We formulate the problem of sample-aware entropy regularized policy iteration, prove its convergence, and provide a practical algorithm named diversity actor-critic (DAC) which is a generalization of soft actor-critic (SAC). Numerical results show that DAC significantly outperforms SAC baselines and other state-of-the-art RL algorithms.

1. INTRODUCTION

Reinforcement learning (RL) aims to maximize the expectation of the discounted reward sum under Markov decision process (MDP) environments (Sutton & Barto, 1998) . When the given task is complex, i.e. the environment has high action-dimensions or sparse rewards, it is important to well explore state-action pairs for high performance (Agre & Rosenschein, 1996) . For better exploration, recent RL considers various methods: maximizing the policy entropy to take actions more uniformly (Ziebart et al., 2008; Fox et al., 2015; Haarnoja et al., 2017) , maximizing diversity gain that yields intrinsic rewards to explore rare states by counting the number of visiting states (Strehl & Littman, 2008; Lopes et al., 2012) , maximizing information gain (Houthooft et al., 2016; Hong et al., 2018) , maximizing model prediction error (Achiam & Sastry, 2017; Pathak et al., 2017) , and so on. In particular, based on policy iteration for soft Q-learning, (Haarnoja et al., 2018a ) considered an offpolicy actor-critic framework for maximum entropy RL and proposed the soft actor-critic (SAC) algorithm, which has competitive performance for challenging continuous control tasks. In this paper, we reconsider the problem of policy entropy regularization in off-policy learning and propose a generalized approach to policy entropy regularization. In off-policy learning, we store and reuse old samples to update the current policy (Mnih et al., 2015) , and it is preferable that the old sample distribution in the replay buffer is uniformly distributed for better performance. However, the simple policy entropy regularization tries to maximize the entropy of the current policy irrespective of the distribution of previous samples. Since the uniform distribution has maximum entropy, the current policy will choose previously less-sampled actions and more-sampled actions with the same probability and hence the simple policy entropy regularization is sample-unaware and sample-inefficient. In order to overcome this drawback, we propose sample-aware entropy regularization, which tries to maximize the weighted sum of the current policy action distribution and the sample action distribution from the replay buffer. We will show that the proposed sampleaware entropy regularization reduces to maximizing the sum of the policy entropy and the α-skewed Jensen-Shannon divergence (Nielsen, 2019) between the policy distribution and the buffer sample action distribution, and hence it generalizes SAC. We will also show that properly exploiting the sample action distribution in addition to the policy entropy over learning phases will yield far better performance.

2. RELATED WORKS

Entropy regularization: Entropy regularization maximizes the sum of the expected return and the policy action entropy. It encourages the agent to visit the action space uniformly for each given state, and the regularized policy is robust to modeling error (Ziebart, 2010) . Entropy regularization is considered in various domains for better optimization: inverse reinforcement learning (Ziebart et al., 2008) , stochastic optimal control problems (Todorov, 2008; Toussaint, 2009; Rawlik et al., 2013) , and off-policy reinforcement learning (Fox et al., 2015; Haarnoja et al., 2017) . (Lee et al., 2019) shows that Tsallis entropy regularization that generalizes usual Shannon-entropy regularization is helpful. (Nachum et al., 2017a) shows that there exists a connection between value-based and policybased RL under entropy regularization. (O'Donoghue et al., 2016) proposed an algorithm combining them, and it is proven that they are equivalent (Schulman et al., 2017a) . The entropy of state mixture distribution is better for pure exploration than a simple random policy (Hazan et al., 2019) . Diversity gain: Diversity gain is used to provide a guidance for exploration to the agent. To achieve diversity gain, many intrinsically-motivated approaches and intrinsic reward design methods have been considered, e.g., intrinsic reward based on curiosity (Chentanez et al., 2005; Baldassarre & Mirolli, 2013) , model prediction error (Achiam & Sastry, 2017; Pathak et al., 2017; Burda et al., 2018) , divergence/information gain (Houthooft et al., 2016; Hong et al., 2018 ), counting (Strehl & Littman, 2008; Lopes et al., 2012; Tang et al., 2017; Martin et al., 2017) , and unification of them (Bellemare et al., 2016) . For self-imitation learning, (Gangwani et al., 2018) considered the Steinvariational gradient decent with the Jensen-Shannon kennel. Off-policy learning: Off-policy learning can reuse any samples generated from behaviour policies for the policy update (Sutton & Barto, 1998; Degris et al., 2012) , so it is sample-efficient as compared to on-policy learning. In order to reuse old samples, a replay buffer that stores trajectories generated by previous policies is used for Q-learning (Mnih et al., 2015; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018a) . To enhance both stability and sample-efficiency, several methods are considered, e.g., combining on-policy and off-policy (Wang et al., 2016; Gu et al., 2016; 2017) , and generalization from on-policy to off-policy (Nachum et al., 2017b; Han & Sung, 2019) . In order to guarantee the convergence of Q-learning, there is a key assumption: Each state-action pair must be visited infinitely often (Watkins & Dayan, 1992) . If the policy does not visit diverse state-action pairs many times, it converges to local optima. Therefore, exploration for visiting different state-action pairs is important for RL, and the original policy entropy regularization encourages exploration (Ahmed et al., 2019) . However, we found that the simple policy entropy regularization can be sample-inefficient in off-policy RL, so we aim to propose a new entropy regularization method that significantly enhances the sample-efficiency for exploration by considering the previous sample distribution in the buffer.

3. BACKGROUND

In this section, we briefly introduce the basic setup and the soft actor-critic (SAC) algorithm.

3.1. SETUP

We assume a basic RL setup composed of an environment and an agent. The environment follows an infinite horizon Markov decision process (S, A, P, γ, r), where S is the state space, A is the action space, P is the transition probability, γ is the discount factor, and r : S × A → R is the reward function. In this paper, we consider a continuous state-action space. The agent has a policy distribution π : S × A → [0, ∞) which selects an action a t for a given state s t at each time step t, and the agent interacts with the environment and receives reward r t := r(s t , a t ) from the environment. Standard RL aims to maximize the discounted return E s0∼p0,τ0∼π [ ∞ t=0 γ t r t ], where τ t = (s t , a t , s t+1 , a t+1 • • • ) is an episode trajectory.

3.2. SOFT ACTOR-CRITIC

Soft actor-critic (SAC) (Haarnoja et al., 2018a) includes a policy entropy regularization term in the objective function for better exploration by visiting the action space uniformly for each given state.

