MQES: MAX-Q ENTROPY SEARCH FOR EFFICIENT EXPLORATION IN CONTINUOUS REINFORCEMENT LEARNING

Abstract

The principle of optimism in the face of (aleatoric and epistemic) uncertainty has been utilized to design efficient exploration strategies for Reinforcement Learning (RL). Different from most prior work targeting at discrete action space, we propose a generally information-theoretic exploration principle called Max-Q Entropy Search (MQES) for continuous RL algorithms. MQES formulates the exploration policy to maximize the information about the globally optimal distribution of Q function, which could explore optimistically and avoid over-exploration by recognizing the epistemic and aleatoric uncertainty, respectively. To make MQES practically tractable, we firstly incorporate distributional and ensemble Q function approximations to MQES, which could formulate the epistemic and aleatoric uncertainty accordingly. Then, we introduce a constraint to stabilize the training, and solve the constrained MQES problem to derive the exploration policy in closed form. Empirical evaluations show that MQES outperforms state-of-the-art algorithms on Mujoco environments.

1. INTRODUCTION

In Reinforcement Learning (RL), one of the fundamental problems is exploration-exploitation dilemma, i.e., the agents explore the states with imperfect knowledge to improve future reward or instead maximize the intermediate reward at the perfectly understood states. The main obstacle of designing efficient exploration strategies is how the agents decide whether the unexplored states leading high cumulative reward or not. Popular exploration strategies, like -greedy (Sutton & Barto, 1998) and sampling from stochastic policy (Haarnoja et al., 2018) , lead to undirected exploration through additional random permutations. Recently, uncertainty of systems are introduced to guide the exploration (Kirschner & Krause, 2018; Mavrin et al., 2019; Clements et al., 2019; Ciosek et al., 2019) . Basically, as Moerland et al. (2017) points out, two source of uncertainty exists in the RL system, i.e., epistemic and aleatoric uncertainty. Epistemic uncertainty is also called parametric uncertainty, which is the ambiguity of models arisen from the imperfect knowledge to the environment, and can be reduced with more data. Aleatoric uncertainty is an intrinsic variation associated with the environment, which is caused by the randomness of environment, and is not affected by the model. In the RL system, if the states are seldom visited, the epistemic uncertainty at these states are relatively large. Hence, the exploration methods should encourage exploration when epistemic uncertainty is large. Moreover, heteroscedastic aleatoric uncertainty means that different states may have difference randomness, which renders different aleatoric uncertainty. If we do not distinguish these two uncertainties and formulate them separately, we may explore the states visited frequently but with high randomness, i.e., low epistemic uncertainty and high aleatoric uncertainty, which is undesirable. By introducing uncertainty, the exploration objectives like Thompson Sampling (TS) (Thompson, 1933; Osband et al., 2016) and Upper Confidence Bound (UCB) (Auer, 2002; Mavrin et al., 2019; Chen et al., 2017) are utilized to guide the exploration in RL. However, since the aleatoric uncertainty in the RL systems are heteroscedastic, i.e., the aleatoric uncertainty depends on states and actions and can be different, the above methods are not efficient. Hence, Nikolov et al. (2019) proposes novel exploration objective called Information-Directed Sampling (IDS) accounting for epistemic uncertainty and heteroscedastic aleatoric uncertainty. However, these methods (Nikolov et al., 2019; Mavrin et al., 2019; Chen et al., 2017; Osband et al., 2016) can only be applied in the environment with discrete action space. In this paper, we propose a generally information-theoretic principle called Max-Q Entropy Search (MQES) for off-policy continuous RL algorithms. Further, as an application example of MQES, we combine distributional RL with soft actor-critic method, where the epistemic and aleatoric uncertainty are formulated accordingly. Then, we incorporate MQES to Distributional Soft Actor-Critic (DSAC) (Ma et al., 2020) method, and show how MQES utilizes both uncertainty to explore. Finally, our results on Mujoco environments show that our method can substantially outperform alternative state-of-the-art algorithms.

2. RELATED WORK

Efficient exploration can improve the efficiency and performance of RL algorithms. With the increasing emphasis on exploration efficiency, various exploration methods have been developed. One kind of methods use intrinsic motivation to stimulate agent to explore from different perspectives, such as count-based novelty (Martin et al., 2017; Ostrovski et al., 2017; Bellemare et al., 2016; Tang et al., 2017; Fox et al., 2018) , prediction error (Pathak et al., 2017) , reachability (Savinov et al., 2019) and information gain on environment dynamics (Houthooft et al., 2016) . Some recently proposed methods in DRL, originating from tracking uncertainty, do efficient exploration under the principle of OFU (optimism in the face of uncertainty), such as Thompson Sampling (Thompson, 1933; Osband et al., 2016) , IDS (Nikolov et al., 2019; Clements et al., 2019) and other customized methods (Moerland et al., 2017; Pathak et al., 2019) . Methods for tracking uncertainty. Bootstrapped DQN (Osband et al., 2016) combines Thompson sampling with value-based algorithms in RL. It is similar to PSRL (Strens, 2000; Osband et al., 2013) , and leverages the uncertainty produced by the value estimations for deep exploration. Bootstrapped DQN has become the common baseline for lots of recent works, and also the well-used approach for capturing epistemic uncertainty (Kirschner & Krause, 2018; Ciosek et al., 2019) . However, this takes only epistemic uncertainty into account. Distributional RL approximates the return distribution directly, such as Categorical DQN (C51) (Bellemare et al., 2017) , QR-DQN (Dabney et al., 2018b) and IQN (Dabney et al., 2018a) . Return distribution can be used to approximate aleatoric uncertainty, but those methods do not take advantage of the return distribution for exploration. Exploration with two types of uncertainty. Traditional OFU methods either focus only on the epistemic uncertainty, or consider the two kinds of uncertainty as a whole, which can easily lead the naive solution to favor actions with higher variances. To address that, Mavrin et al. ( 2019) studies how to take advantage of distributions learned by distributional RL methods for efficient exploration under both kinds of uncertainty, proposing Decaying Left Truncated Variance (DLTV). Nikolov et al. (2019) and Clements et al. (2019) propose to use Information Direct Sampling (Kirschner & Krause, 2018) for efficient exploration in RL (IDS for RL), which estimate both kinds of uncertainty and use IDS to make decision for acting with environment. We refer to the practice of uncertainty estimation in (Clements et al., 2019) as shown in Sec. 4.2.1. IDS integrates both uncertainty and has made progress on the issue of exploration, but this is limited on discrete action space. We do focus on how best to exploit both uncertainty for efficient exploration in a continuous action space in our paper. Optimistic Actor Critic. More closely related to our work is the paper of OAC (Ciosek et al., 2019) , which uses epistemic uncertainty to build the upper bound of Q estimation Q UB . OAC is based on Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , additionally proposing exploration bonus to facilitate exploration. Despite the advantages that OAC has achieved over SAC, it does not consider the potential impact of the aleatoric uncertainty, which may cause misleading for exploration.

