AN EMPIRICAL STUDY OF NEURAL CONTEXTUAL BANDIT ALGORITHMS

Abstract

Recent advances in representation learning have made significant influences on solutions of contextual bandit problems. Neural bandit algorithms have been actively developed and reported to gain extraordinary performance improvement against classical bandit algorithms in numerous papers. However, there lacks a comprehensive comparison among the existing neural bandit algorithms, and it is still not clear whether or when they can succeed in complex real-world problems. In this work, we present an inclusive empirical study on three different categories of existing neural bandit algorithms on several real-world datasets. The results show that such algorithms are highly competitive against their classical counterparts in most cases, however the advantage is not consistent. The results also reveal crucial challenges for future research in neural bandit algorithms.

1. INTRODUCTION

In recent decades, contextual bandit algorithms have been extensively studied (Langford & Zhang, 2007; Chu et al., 2011) for solving sequntial decision-making problems. In such problems, an agent iteractively interacts with the environment to maximize its accumulated rewards over time based on the given context. The essence of contextual bandits is to balance exploration and exploitation under uncertainty. In practice, contextual bandit algorithms have wide applications in real-world scenarios, including content recommendation (Li et al., 2010; Wu et al., 2016) , online advertising (Schwartz et al., 2017; Nuara et al., 2018) , and mobile health (Lei et al., 2017; Tewari & Murphy, 2017) . Linear contextual bandits, which assume the expected reward is linearly related to the given context features, have been extensively studied in literature (Auer et al., 2002; Rusmevichientong & Tsitsiklis, 2010; Dani et al., 2008; Abbasi-Yadkori et al., 2011; Chu et al., 2011) . Though linear contextual bandit algorithms are theoretically sound and succeed in a number of real-world applications, the linear assumption fails in capturing non-linear relations between the context vector and the reward. This motivates the study of generalized linear bandits (Li et al., 2017; Faury et al., 2020; Filippi et al., 2010) et al., 2020) and NPR (Jia et al., 2021) explore the entire neural network parameter space to obtain nearly optimal regret using the neural tangent kernel technique (Jacot et al., 2018) . All the proposed neural contextual bandit algorithms reported encouraging empirical improvement compared to their classical counterparts or a selected subset of neural contextual bandit algorithms. However, there still lacks a horizontal comparison among the neural contextual bandit solutions on more comprehensive real-world datasets. We argue, for practical applications, it is important to understand when and how a neural contextual algorithm better suits a specific task. In this work, we provide an extensive empirical evaluation on a set of most referred neural contextual bandit algorithms on nine real-world datasets: six K-class classification datasets from UCI machine learning datasets (Dua & Graff, 2017) , one learning to rank dataset for web search , and two



and kernelized bandits(Krause & Ong, 2011; Chowdhury & Gopalan, 2017; Valko  et al., 2013). Recently, deep neural networks (DNN)(LeCun et al., 2015)  have been introduced to learn the underlying reward mapping directly.(Riquelme et al., 2018)  developed NeuralLinear, which applied a Bayesian linear regression on the feature mappings learned by the last layer of a neural network and get the approximation of the reward via Thompson Sampling.(Zahavy &  Mannor, 2019)  extended NeuralLinear by adding a likelihood matching mechanism to overcome the catastrophic forgetting problem.(Xu et al., 2020)  proposed Neural-LinUCB by performing exploration over the last layer of the neural network.NeuralUCB (Zhou et al., 2020), NeuralTS (Zhang

