EMPIRICAL ANALYSIS OF REPRESENTATION LEARNING AND EXPLORATION IN NEURAL KERNEL BANDITS

Abstract

Neural bandits have been shown to provide an efficient solution to practical sequential decision tasks that have nonlinear reward functions. The main contributor to that success is approximate Bayesian inference, which enables neural network (NN) training with uncertainty estimates. However, Bayesian NNs often suffer from a prohibitive computational overhead or operate on a subset of parameters. Alternatively, certain classes of infinite neural networks were shown to directly correspond to Gausian processes (GP) with neural kernels (NK). NK-GPs provide accurate uncertainty estimates and can be trained faster than most Bayesian NNs. We propose to guide common bandit policies with NK distributions and show that NK bandits achieve state-of-the-art performance on nonlinear structured data. Moreover, we propose a framework for measuring independently the ability of a bandit algorithm to learn representations and explore, and use it to analyze the impact of NK distributions w.r.t. those two aspects. We consider policies based on a GP and a Student's t-process (TP). Furthermore, we study practical considerations, such as training frequency and model partitioning. We believe our work will help better understand the impact of utilizing NKs in applied settings.

1. INTRODUCTION

Contextual bandit algorithms, like upper confidence bound (UCB) (Auer et al., 2002) or Thompson sampling (TS) (Thompson, 1933) , typically utilize Bayesian inference to facilitate both representation learning and uncertainty estimation. Neural networks (NN) are increasingly applied to model non-linear relations between contexts and rewards (Allesiardo et al., 2014; Collier & Llorens, 2018) . Due to lack of a one-size-fits-all solution to model uncertainty with neural networks, various NN models result in trade-offs in the bandits framework. Riquelme et al. ( 2018) compared a comprehensive set of modern Bayesian approximation methods in their benchmark, and observed that state-of-the-art Bayesian NNs require training times prohibitive for practical bandit applications. Classic approaches, on the other hand, lack complexity to accurately guide a nonlinear policy. Further, the authors showed that the neural-linear method provides the best practical performance and strikes the right balance between computational efficiency and Bayesian parameter estimation. Bandit policies balance exploration and exploitation with two terms: (1) a mean reward estimate and (2) an uncertainty term (Lattimore & Szepesvári, 2020). From a Bayesian perspective those two terms represent the first two moments of a posterior predictive distribution, e.g. a Gaussian process (GP). Research on neural kernels (NK) has recently established a correspondence between deep networks and Gaussian processes (GP) (Lee et al., 2018) . The resulting model can be trained more efficiently than most Bayesian NNs. In this work we focus on the conditions in which NK-GPs provide a competitive advantage over other NN approaches in bandit settings. Even though NKs have been shown to lack the full representational power of the corresponding NNs, they outperform finite fully-connected networks in small data regimes (Arora et al., 2019b) , and combined with GPs, successfully solve simple reinforcement learning tasks (Goumiri et al., 2020) . NK-GPs provide a fully probabilistic treatment of infinite NNs, and therefore result in more accurate predictive distributions than those used in most of the state-of-the-art neural bandit models. We hypothesized that NKs would outperform NNs for datasets possessing certain characteristics, like data complexity, the need for exploration, or reward type. The full list of considered characteristics can be found in Tab. 2. We ran an empirical assessment of NK bandits using the contextual bandit benchmark proposed by 1

