REGRET BOUNDS AND REINFORCEMENT LEARNING EXPLORATION OF EXP-BASED ALGORITHMS Anonymous

Abstract

EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.

1. INTRODUCTION

Multi-armed bandit (MAB) is to maximize cumulative reward of a player throughout a bandit game by choosing different arms at each time step. It is also equivalent to minimizing the regret defined as the difference between the best rewards that can be achieved and the actual reward gained by the player. Formally, given time horizon T , in time step t ≤ T the player choose one arm a t among K arms, receives r t at among rewards r t = (r t 1 , r t 2 , . . . , r t K ), and maximizes the total reward T t=1 r t at or minimizes the regret. Computationally efficient and with abundant theoretical analyses are the EXP-type MAB algorithms. In EXP3.P, each arm has a trust coefficient (weight). The player samples each arm with probability being the sum of its normalized weights and a bias term, receives reward of the sampled arm and exponentially updates the weights based on the corresponding reward estimates. It achieves the regret of the order O( √ T ) in a high probability sense. In EXP4, there are any number of experts. Each has a sample rule over actions and a weight. The player samples according to the weighted average of experts' sample rules and updates the weights respectively. Contextual bandit is a variant of MAB by adding context or state space S. At time step t, the player has context s t ∈ S with s 1:T = (s 1 , s 2 , . . . , s T ) being independent. Rewards r t follow F (µ(s t )) where F is any distribution and µ(s t ) is the mean vector that depends on state s t . Reinforcement Learning (RL) generalizes contextual bandit, where state and reward transitions follow a Markov Decision Process (MDP) represented by transition kernel P (s t+1 , r t |a t , s t ). A key challenge in RL is the trade-off between exploration and exploitation. Exploration is to encourage the player to try new arms in MAB or new actions in RL to understand the game better. It helps to plan for the future, but with the sacrifice of potentially lowering the current reward. Exploitation aims to exploit currently known states and arms to maximize the current reward, but it potentially prevents the player to gain more information to increase local reward. To maximize the cumulative reward, the player needs to know the game by exploration, while guaranteeing current reward by exploitation. How to incentivize exploration in RL has been a main focus in RL. Since RL is built on MAB, it is natural to extend MAB techniques to RL and UCB is such a success. UCB (Auer et al. (2002a) ) motivates count-based exploration (Strehl and Littman, 2008) In order to address weak points of these various exploration algorithms in the RL context, the notion of experts is natural and thus EXP-type MAB algorithms are appropriate. The allowance of arbitrary experts provides exploration for harder contextual bandits and hence providing exploration possibilities for RL. We develop an EXP4 exploration algorithm for RL that relies on several general experts. This is the first RL algorithm using several exploration experts enabling global exploration. Focusing on DQN, in the computational study we focus on two agents consisting of RND and -greedy DQN. We implement the RL EXP4 algorithm on the hard-to-explore RL game Montezuma's Revenge and compare it with the benchmark algorithm RND (Burda et al. ( 2018)). The numerical results show that the algorithm gains more exploration than RND and it gains the ability of global exploration by not getting stuck in local maximums of RND. Its total reward also increases with training. Overall, our algorithm improves exploration and exploitation on the benchmark game and demonstrates a learning process in RL. Reward in RL in many cases is unbounded which relates to unbounded MAB rewards. There are three major versions of MAB: Adversarial, Stochastic, and herein introduced Gaussian. For adversarial MAB, rewards of the K arms r t can be chosen arbitrarily by adversaries at step t. For stochastic MAB, the rewards at different steps are assumed to be i.i.d. and the rewards across arms are independent. It is assumed that 0 ≤ r t i ≤ 1 for any arm i and step t. For Gaussian MAB, rewards r t follow multi-variate normal N (µ, Σ) with µ being the mean vector and Σ the covariance matrix of the K arms. Here the rewards are neither bounded, nor independent among the arms. For this reason the introduced Gaussian MAB reflects the RL setting and is the subject of our MAB analyses of EXP3.P. EXP-type algorithms (Auer et al. (2002b) ) are optimal in the two classical MABs. Auer et al. (2002b) show lower and upper bounds on regret of the order O( √ T ) for adversarial MAB and of the order O(log(T )) for stochastic MAB. All of the proofs of these regret bounds by EXP-type algorithms are based on the bounded reward assumption, which does not hold for Gaussian MAB. Therefore, the regret bounds for Gaussian MAB with unbounded rewards studied herein are significantly different from prior works. We show both lower and upper bounds on regret of Gaussian MAB under certain assumptions. Some analyses even hold for more generally distributed MAB. Upper bounds borrow some ideas from the analysis of the EXP3.P algorithm in Auer et al. (2002b) for bounded MAB to our unbounded MAB, while lower bounds are by our brand new construction of instances. Precisely, we derive lower bounds of order Ω(T ) for certain fixed T and upper bounds of order O * ( √ T ) for T being large enough. The question of bounds for any value of T remains open. The main contributions of this work are as follows. On the analytical side we introduce Gaussian MAB with the unique aspect and challenge of unbounded rewards. We provide the very first regret lower bound in such a case by constructing a novel family of Gaussian bandits and we are able to analyze the EXP3.P algorithm for Gaussian MAB. Unbounded reward poses a non-trivial challenge in the analyses. We also provide the very first extension of EXP4 to RL exploration. We show its superior performance on two hard-to-explore RL games. A literature review is provided in Section 2. Then in Section 3 we exhibit upper bounds for unbounded MAB of the EXP3.P algorithm and lower bounds, respectively. Section 4 discusses the EXP4 algorithm for RL exploration. Finally, in Section 5, we present numerical results related to the proposed algorithm.

2. LITERATURE REVIEW

The importance of exploration in RL is well understood. Count-based exploration in RL relies on UCB. Strehl and Littman (2008) develop Bellman value iteration V (s) = max a R(s, a) + γE[V (s )] + βN (s, a) -1 2 , where N (s, a) is the number of visits to (s, a) for state s and action a. Value N (s, a) -1 2 is positively correlated with curiosity of (s, a) and encourages exploration. This method is limited to tableau model-based MDP for small state spaces, while Bellemare et al. (2016) introduce Pseudo-Count exploration for non-tableau MDP with density models. In conjunction with DQN, -greedy in Mnih et al. ( 2015) is a simple exploration technique using DQN. Besides -greedy, intrinsic model exploration computes intrinsic rewards by the accuracy of a model trained on experiences. Intrinsic rewards directly measure and incentivize exploration if



in RL and the subsequent Pseudo-Count exploration (Bellemare et al., 2016). New deep RL exploration algorithms have been recently proposed. Using deep neural networks to keep track of the Q-values by means of Q-networks in RL is called DQN (Mnih et al. (2013)). This combination of deep learning and RL has shown great success. -greedy in Mnih et al. (2015) is a simple exploration technique using DQN. Besides

