AN EMPIRICAL STUDY OF NEURAL CONTEXTUAL BANDIT ALGORITHMS

Abstract

Recent advances in representation learning have made significant influences on solutions of contextual bandit problems. Neural bandit algorithms have been actively developed and reported to gain extraordinary performance improvement against classical bandit algorithms in numerous papers. However, there lacks a comprehensive comparison among the existing neural bandit algorithms, and it is still not clear whether or when they can succeed in complex real-world problems. In this work, we present an inclusive empirical study on three different categories of existing neural bandit algorithms on several real-world datasets. The results show that such algorithms are highly competitive against their classical counterparts in most cases, however the advantage is not consistent. The results also reveal crucial challenges for future research in neural bandit algorithms.

1. INTRODUCTION

In recent decades, contextual bandit algorithms have been extensively studied (Langford & Zhang, 2007; Chu et al., 2011) for solving sequntial decision-making problems. In such problems, an agent iteractively interacts with the environment to maximize its accumulated rewards over time based on the given context. The essence of contextual bandits is to balance exploration and exploitation under uncertainty. In practice, contextual bandit algorithms have wide applications in real-world scenarios, including content recommendation (Li et al., 2010; Wu et al., 2016) , online advertising (Schwartz et al., 2017; Nuara et al., 2018) , and mobile health (Lei et al., 2017; Tewari & Murphy, 2017) . Linear contextual bandits, which assume the expected reward is linearly related to the given context features, have been extensively studied in literature (Auer et al., 2002; Rusmevichientong & Tsitsiklis, 2010; Dani et al., 2008; Abbasi-Yadkori et al., 2011; Chu et al., 2011) . Though linear contextual bandit algorithms are theoretically sound and succeed in a number of real-world applications, the linear assumption fails in capturing non-linear relations between the context vector and the reward. This motivates the study of generalized linear bandits (Li et al., 2017; Faury et al., 2020; Filippi et al., 2010) and kernelized bandits (Krause & Ong, 2011; Chowdhury & Gopalan, 2017; Valko et al., 2013) . Recently, deep neural networks (DNN) (LeCun et al., 2015) have been introduced to learn the underlying reward mapping directly. (Riquelme et al., 2018) developed NeuralLinear, which applied a Bayesian linear regression on the feature mappings learned by the last layer of a neural network and get the approximation of the reward via Thompson Sampling. (Zahavy & Mannor, 2019) extended NeuralLinear by adding a likelihood matching mechanism to overcome the catastrophic forgetting problem. (Xu et al., 2020) proposed Neural-LinUCB by performing exploration over the last layer of the neural network. NeuralUCB (Zhou et al., 2020 ), NeuralTS (Zhang et al., 2020 ) and NPR (Jia et al., 2021) explore the entire neural network parameter space to obtain nearly optimal regret using the neural tangent kernel technique (Jacot et al., 2018) . All the proposed neural contextual bandit algorithms reported encouraging empirical improvement compared to their classical counterparts or a selected subset of neural contextual bandit algorithms. However, there still lacks a horizontal comparison among the neural contextual bandit solutions on more comprehensive real-world datasets. We argue, for practical applications, it is important to understand when and how a neural contextual algorithm better suits a specific task. In this work, we provide an extensive empirical evaluation on a set of most referred neural contextual bandit algorithms on nine real-world datasets: six K-class classification datasets from UCI machine learning datasets (Dua & Graff, 2017) , one learning to rank dataset for web search , and two logged bandit dataset for online recommendations . We choose LinUCB as a reference linear bandit algorithm against six selected neural contextual bandit algorithms: NeuralLinear , NeuralLinear-LikelihoodMatching , NeuralUCB , Neural-LinUCB , NeuralTS , and NPR . We evaluated all bandit algorithms under the metric of regret/reward and running time, as long as the model sensitivity to the choices of neural netowrk architectures and hyper-parameter settings. We conclude that in most cases, neural contextual bandit algorithms provide significant performance improvement compared to the linear model, while in some specific cases, the advantage of neural bandits is marginal. Besides, the results demonstrate that across different datasets and problem settings, different neural contextual bandit algorithms show various patterns. In other words, no single neural bandit algorithm outperforms others in every bandit problem.

2. ALGORITHMS

In this section, we first introduce the general setting of contextual bandit problem, and then present the existing bandit solutions, including both linear and neural models.

2.1. CONTEXTUAL BANDIT PROBLEM

We focus on the problem of contextual bandits, where the agent iteratively interacts with the environment for T rounds. T is known beforehand. At each round, the agent will choose one arm from K candidate arms, where each arm is associated with a d-dimensional context vector: x a ∈ R d . Once the arm a t is selected, the agent will receive the corresponding reward r t,at that generated as r t,at = h(x t,at ) + η t , where h is an unknown reward mapping and η t is υ-sub-Gaussian noise. The goal of a bandit algorithm is to minimize the pseudo regret: R T = E T t=1 (r t,a * t -r t,at ) , where a * t is the optimal arm at round t with the maximum expected reward.

2.2. LINEAR CONTEXTUAL BANDIT ALGORITHMS

In linear contextual bandits, the unknown reward function h(•) is assumed to be a linear function: h(x t,at ) = x ⊤ t,at θ * , where θ * ∈ R d is the underlying unknown model weight. One of the most popularl linear contextual bandit algorithms is LinUCB (Li et al., 2010; Abbasi-Yadkori et al., 2011) . At each round t, a ridge regression is applied to learn the current model θ t based on the observations collected so far, θ t = arg min θ t-1 τ =1 (r τ,aτ -x ⊤ τ,aτ θ) 2 + λ 2 ∥θ∥ 2 2 , (2.2) where λ is the coefficient of L 2 regularization. Then, LinUCB pulls the arm with highest upper confidence bound: a t = arg max a∈[K] x ⊤ t,a θ t + α t x ⊤ t,a A -1 t x t,a , A t = λI + t-1 τ =1 x τ,aτ x ⊤ τ,aτ where α t > 0 is a scaling factor that controls the exploration rate. Once the reward of the pulled arm is received, the model will be updated to θ t+1 . By leveraging the width of confidence interval of reward estimation, LinUCB well balances the explore-exploit trade-off in bandit learning and obtains a sublinear regret with respect to the time horizon T .

2.3. NEURAL BANDIT ALGORITHMS

Numerous attempts have been made to apply neural networks in contextual bandit problems, under the fact that neural networks are remarkable approximators of any unknown functions (Cybenko, 1989) . In the following sections, we categorize existing neural contextual bandit algorithms into three main categories based on their exploration methods.

