REGRET BOUNDS AND REINFORCEMENT LEARNING EXPLORATION OF EXP-BASED ALGORITHMS Anonymous

Abstract

EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.

1. INTRODUCTION

Multi-armed bandit (MAB) is to maximize cumulative reward of a player throughout a bandit game by choosing different arms at each time step. It is also equivalent to minimizing the regret defined as the difference between the best rewards that can be achieved and the actual reward gained by the player. Formally, given time horizon T , in time step t ≤ T the player choose one arm a t among K arms, receives r t at among rewards r t = (r t 1 , r t 2 , . . . , r t K ), and maximizes the total reward T t=1 r t at or minimizes the regret. Computationally efficient and with abundant theoretical analyses are the EXP-type MAB algorithms. In EXP3.P, each arm has a trust coefficient (weight). The player samples each arm with probability being the sum of its normalized weights and a bias term, receives reward of the sampled arm and exponentially updates the weights based on the corresponding reward estimates. It achieves the regret of the order O( √ T ) in a high probability sense. In EXP4, there are any number of experts. Each has a sample rule over actions and a weight. The player samples according to the weighted average of experts' sample rules and updates the weights respectively. Contextual bandit is a variant of MAB by adding context or state space S. At time step t, the player has context s t ∈ S with s 1:T = (s 1 , s 2 , . . . , s T ) being independent. Rewards r t follow F (µ(s t )) where F is any distribution and µ(s t ) is the mean vector that depends on state s t . Reinforcement Learning (RL) generalizes contextual bandit, where state and reward transitions follow a Markov Decision Process (MDP) represented by transition kernel P (s t+1 , r t |a t , s t ). A key challenge in RL is the trade-off between exploration and exploitation. Exploration is to encourage the player to try new arms in MAB or new actions in RL to understand the game better. It helps to plan for the future, but with the sacrifice of potentially lowering the current reward. Exploitation aims to exploit currently known states and arms to maximize the current reward, but it potentially prevents the player to gain more information to increase local reward. To maximize the cumulative reward, the player needs to know the game by exploration, while guaranteeing current reward by exploitation. How to incentivize exploration in RL has been a main focus in RL. Since RL is built on MAB, it is natural to extend MAB techniques to RL and UCB is such a success. UCB (Auer et al. (2002a) ) motivates count-based exploration (Strehl and Littman, 2008) in RL and the subsequent Pseudo-Count exploration (Bellemare et al., 2016) . New deep RL exploration algorithms have been recently proposed. Using deep neural networks to keep track of the Q-values by means of Q-networks in RL is called DQN (Mnih et al. (2013) 



). This combination of deep learning and RL has shown great success. -greedy in Mnih et al. (2015) is a simple exploration technique using DQN. Besides -greedy, intrinsic model exploration computes intrinsic rewards by focusing on experiences. Intrinsic rewards directly measure and incentivize exploration if added to extrinsic (actual) rewards of RL, e.g. DORA (Fox et al., 2018) and (Stadie et al., 2015). Random Network Distillation (RND) (Burda et al., 2018) is a more recent suggestion relying on a fixed target network. A drawback of RND is its local focus without global exploration.

