OPTIMISTIC EXPLORATION WITH BACKWARD BOOTSTRAPPED BONUS FOR DEEP REINFORCEMENT LEARNING

Abstract

Optimism in the face of uncertainty is a principled approach for provably efficient exploration for reinforcement learning in tabular and linear settings. However, such an approach is challenging in developing practical exploration algorithms for Deep Reinforcement Learning (DRL). To address this problem, we propose an Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL by following these two principles. OEB3 is built on bootstrapped deep Q-learning, a non-parametric posterior sampling method for temporally-extended exploration. Based on such a temporally-extended exploration, we construct an UCB-bonus indicating the uncertainty of Q-functions. The UCB-bonus is further utilized to estimate an optimistic Q-value, which encourages the agent to explore the scarcely visited states and actions to reduce uncertainty. In the estimation of Q-function, we adopt an episodic backward update strategy to propagate the future uncertainty to the estimated Q-function consistently. Extensive evaluations show that OEB3 outperforms several state-of-the-art exploration approaches in MNIST maze and 49 Atari games.

1. INTRODUCTION

In Reinforcement learning (RL) (Sutton & Barto, 2018) formalized by the Markov decision process (MDP), an agent aims to maximize the long-term reward by interacting with the unknown environment. The agent takes actions according to the knowledge of experiences, which leads to the fundamental problem of the exploration-exploitation dilemma. An agent may choose the best decision given current information or acquire more information by exploring the poorly understood states and actions. Exploring the environment may sacrifice immediate rewards but potentially improves future performance. The exploration strategy is crucial for the RL agent to find the optimal policy. The theoretical RL offers various provably efficient exploration methods in tabular and linear MDPs with the basic value iteration algorithm: least-squares value iteration (LSVI). Optimism in the face of uncertainty (Auer & Ortner, 2007; Jin et al., 2018) is a principled approach. In tabular cases, the optimism-based methods incorporate upper confidence bound (UCB) into the value functions as bonuses and attain the optimal worst-case regret (Azar et al., 2017; Jaksch et al., 2010; Dann & Brunskill, 2015) . Randomized value function based on posterior sampling chooses actions according to the randomly sampled statistically plausible value functions and is known to achieve near-optimal worst-case and Bayesian regrets (Osband & Van Roy, 2017; Russo, 2019) . Recently, the theoretical analyses in tabular cases are extended to linear MDP where the transition and the reward function are assumed to be linear. In linear cases, optimistic LSVI (Jin et al., 2020) attains a near-optimal worst-case regret by using a designed bonus, which is provably efficient. Randomized LSVI (Zanette et al., 2020) also attains a near-optimal worst-case regret. Although the analyses in tabular and linear cases provide attractive approaches for efficient exploration, these principles are still challenging in developing a practical exploration algorithm for Deep Reinforcement Learning (DRL) (Mnih et al., 2015) , which achieves human-level performance in large-scale tasks such as Atari games and robotic tasks. For example, in linear cases, the bonus in optimistic LSVI (Jin et al., 2020) and nontrivial noise in randomized LSVI (Zanette et al., 2020) are tailored to linear models (Abbasi-Yadkori et al., 2011) , and are incompatible with practically powerful function approximations such as neural networks. To address this problem, we propose the Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL. OEB3 is an instantiation of optimistic LSVI (Jin et al., 2020) in DRL by using a general-purpose UCB-bonus to provide an optimistic Q-value and a randomized value function to perform temporally-extended exploration. We propose an UCB-bonus that represents the disagreement of bootstrapped Q-functions (Osband et al., 2016) to measure the epistemic uncertainty of the unknown optimal value function. Q-value added by UCB-bonus becomes an optimistic Q + function that is higher than Q for scarcely visited state-action pairs and remains close to Q for frequently visited ones. The optimistic Q + function encourages the agent to explore the states and actions with high UCB-bonuses, signifying scarcely visited areas or meaningful events in completing a task. We propose an extension of episodic backward update (Lee et al., 2019) to propagate the future uncertainty to the estimated action-value function consistently within an episode. The backward update also enables OEB3 to perform highly sample-efficient training. Comparing to existing count-based and curiosity-driven exploration methods (Taiga et al., 2020), OEB3 has several benefits. (1) We utilize intrinsic rewards to produce optimistic value function and also take advantage of bootstrapped Q-learning to perform temporally-consistent exploration while existing methods do not combine these two principles. (2) The UCB-bonus measures the disagreement of Q-values, which considers the long-term uncertainty in an episode rather than the single-step uncertainty used in most bonus-based methods (Pathak et al., 2019; Burda et al., 2019b) . Meanwhile, the UCB-bonus is computed without introducing additional modules compared to bootstrapped DQN. (3) We provide a theoretical analysis showing that OEB3 has theoretical consistency with optimistic LSVI in linear cases. ( 4) Extensive evaluations show that OEB3 outperforms several state-of-the-art exploration approaches in MNIST maze and 49 Atari games.

2. BACKGROUND

In this section, we introduce bootstrapped DQN (Osband et al., 2016) , which we utilize in OEB3 for temporarily-extended exploration. We further introduce optimistic LSVI (Jin et al., 2020) , which we instantiate via DRL and propose OEB3.

2.1. BOOTSTRAPPED DQN

We consider an episodic MDP represented as (S, A, T, P, r), where T ∈ Z + is the episode length, S is the state space, A is the action space, r is the reward function, and P is the unknown dynamics. In each timestep, the agent obtains the current state s t , takes an action a t , interacts with the environment, receives a reward r t , and updates to the next state s t+1 . The action-value function Q π (s t , a t ) := E π T -1 i=t γ i-t r i ] represents the expected cumulative reward starting from state s t , taking action a t , and thereafter following policy π(a t |s t ) until the end of the episode. γ ∈ [0, 1) is the discount factor. The optimal value function Q * = max π Q π , and the optimal action a * = arg max a∈A Q * (s, a). Deep Q-Network (DQN) uses a deep neural network with parameters θ to approximate the Q-function. The loss function takes the form of L(θ) = E[(y t -Q(s t , a t ; θ)) 2 (s t , a t , r t , s t+1 ) ∼ D], where y t = r t + γ max a Q(s t+1 , a ; θ -) is the target value, and θ -is the parameter of the target network. The agent accumulates experiences (s t , a t , r t , s t+1 ) in a replay buffer D and samples mini-batches in training. Bootstrapped DQN (Osband et al., 2016; 2018) is a non-parametric posterior sampling method, which maintains K estimations of Q-values to represent the posterior distribution of the randomized value function. Bootstrapped DQN uses a multi-head network that contains a shared convolution network and K heads. Each head defines a Q k -function. Bootstrapped DQN diversifies different Q k by using different random initialization and individual target networks. The loss for training Q k is L(θ k ) = E r t + γ max a Q k (s t+1 , a ; θ k-) -Q k (s t , a t ; θ k ) 2 (s t , a t , r t , s t+1 ) ∼ D . (1) The k-th head Q k (s, a; θ k ) is trained with its own target network Q k (s, a; θ k-). If k-th head is sampled at the start of an episode when interacting with the environment, the agent will follow Q k to choose actions in the whole episode, which provides temporally-consistent exploration for DRL.

