EMPIRICAL ANALYSIS OF REPRESENTATION LEARNING AND EXPLORATION IN NEURAL KERNEL BANDITS

Abstract

Neural bandits have been shown to provide an efficient solution to practical sequential decision tasks that have nonlinear reward functions. The main contributor to that success is approximate Bayesian inference, which enables neural network (NN) training with uncertainty estimates. However, Bayesian NNs often suffer from a prohibitive computational overhead or operate on a subset of parameters. Alternatively, certain classes of infinite neural networks were shown to directly correspond to Gausian processes (GP) with neural kernels (NK). NK-GPs provide accurate uncertainty estimates and can be trained faster than most Bayesian NNs. We propose to guide common bandit policies with NK distributions and show that NK bandits achieve state-of-the-art performance on nonlinear structured data. Moreover, we propose a framework for measuring independently the ability of a bandit algorithm to learn representations and explore, and use it to analyze the impact of NK distributions w.r.t. those two aspects. We consider policies based on a GP and a Student's t-process (TP). Furthermore, we study practical considerations, such as training frequency and model partitioning. We believe our work will help better understand the impact of utilizing NKs in applied settings.

1. INTRODUCTION

Contextual bandit algorithms, like upper confidence bound (UCB) (Auer et al., 2002) or Thompson sampling (TS) (Thompson, 1933) , typically utilize Bayesian inference to facilitate both representation learning and uncertainty estimation. Neural networks (NN) are increasingly applied to model non-linear relations between contexts and rewards (Allesiardo et al., 2014; Collier & Llorens, 2018) . Due to lack of a one-size-fits-all solution to model uncertainty with neural networks, various NN models result in trade-offs in the bandits framework. Riquelme et al. ( 2018) compared a comprehensive set of modern Bayesian approximation methods in their benchmark, and observed that state-of-the-art Bayesian NNs require training times prohibitive for practical bandit applications. Classic approaches, on the other hand, lack complexity to accurately guide a nonlinear policy. Further, the authors showed that the neural-linear method provides the best practical performance and strikes the right balance between computational efficiency and Bayesian parameter estimation. Bandit policies balance exploration and exploitation with two terms: (1) a mean reward estimate and (2) an uncertainty term (Lattimore & Szepesvári, 2020). From a Bayesian perspective those two terms represent the first two moments of a posterior predictive distribution, e.g. a Gaussian process (GP). Research on neural kernels (NK) has recently established a correspondence between deep networks and Gaussian processes (GP) (Lee et al., 2018) . The resulting model can be trained more efficiently than most Bayesian NNs. In this work we focus on the conditions in which NK-GPs provide a competitive advantage over other NN approaches in bandit settings. Even though NKs have been shown to lack the full representational power of the corresponding NNs, they outperform finite fully-connected networks in small data regimes (Arora et al., 2019b) , and combined with GPs, successfully solve simple reinforcement learning tasks (Goumiri et al., 2020) . NK-GPs provide a fully probabilistic treatment of infinite NNs, and therefore result in more accurate predictive distributions than those used in most of the state-of-the-art neural bandit models. We hypothesized that NKs would outperform NNs for datasets possessing certain characteristics, like data complexity, the need for exploration, or reward type. The full list of considered characteristics can be found in Tab. 2. We ran an empirical assessment of NK bandits using the contextual bandit benchmark proposed by problems derived from balanced classification tasks, with high non-linear feature entanglement. Measuring empirical performance of bandits may require a more detailed analysis than rendered in standard frameworks, in which the task dimensions are usually entangled. Practitioners choosing the key components of bandits (e.g. forms of predictive distributions, policies, or hyperparameters), looking for the best performance on a single metric, may be missing a more nuanced view along the key dimensions of representation learning and exploration. Moreover, interactions between a policy and a predictive distribution may differ depending on a particular view. In order to provide a detailed performance assessment we need to test these two aspects separately. For most real-world datasets, this is not feasible as we do not have access to the data generating distribution. In order to gain insight into the capability to explore, Riquelme et al. ( 2018) created the "wheel dataset", which lets us measure performance under sparse reward conditions. We propose to expand the wheel dataset by a parameter that controls the representational complexity of the task, and use it to perform an ablation study on a set of NK predictive distributions (He et al., 2020; Lee et al., 2020) with stochastic policies. We remark that our approach provides a general framework for evaluating sequential decision algorithms. Our contributions can be summarized as follows: • We utilize recent results on equivalence between NKs and deep neural networks to derive practical bandit algorithms. • We show that NK bandits achieve state-of-the-art performance on complex structured data. • We propose an empirical framework that decouples evaluation of representation learning and exploration in sequential decision processes. Within this framework: • We evaluate the most common NK predictive distributions and bandit policies and assess their impact on the key aspects of bandits -exploration and exploitation. • We analyze the efficacy of a Student's t-process (TP), as proposed by Shah et al. (2013) ; Tracey & Wolpert (2018), in improving exploration in NK bandits. We make our work fully reproducible by providing the implementation of our algorithm and the experimental benchmark.foot_0 The explanation of the mathematical notation can be found in Sec. A.

2. BACKGROUND

In this section we provide a necessary background on NK bandits from the perspective of Bayesian inference. Our approach focuses heavily on the posterior predictive derivation and the Bayesian interpretation of predictive distributions obtained as ensembles of inite-width deep neural networks. We build on material related to neural bandits, NKs, and the posterior predictive. We also point out the unique way in which the methods are combined and utilized in our approach.

2.1. CONTEXTUAL BANDIT POLICIES

Contextual multi-armed bandits are probabilistic models that at each round t ∈ [T ] of a sequential decision process receive a set of k arms and a global context x t . The role of a policy π is to select an action a ∈ [k] and observe the associated reward y t,a . We name the triplet (x t , a t , y t,at ) an observation at time t. Observations are stored in a dataset, which can be joint (D) or separate for each arm (D a ). The objective is to minimize the cumulative regret R T = E 



Code link temporarily hidden for blind review.



t=1 max a y t,a -T t=1 y t,at or, alternatively, to maximize the cumulative reward T t=1 y t,at . Thompson sampling (TS) is a policy that operates on the Bayesian principle. The reward estimates for each arm are computed in terms of a posterior predictive distribution p(y * |a, D a , θ) = p(y * |θ)p(θ|D a )dθ. Linear, kernel, and neural TS (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017; Riquelme et al., 2018) are all contextual variants of TS, which commonly model

