ONLINE LIMITED MEMORY NEURAL-LINEAR BAN

Abstract

We study neural-linear bandits for solving problems where both exploration and representation learning play an important role. Neural-linear bandits leverage the representation power of deep neural networks and combine it with efficient exploration mechanisms, designed for linear contextual bandits, on top of the last hidden layer. Since the representation is optimized during learning, information regarding exploration with "old" features is lost. We propose the first limited memory neurallinear bandit that is resilient to this catastrophic forgetting phenomenon by solving a semi-definite program. We then approximate the semi-definite program using stochastic gradient descent to make the algorithm practical and adjusted for online usage. We perform simulations on a variety of data sets, including regression, classification, and sentiment analysis. We observe that our algorithm achieves superior performance and shows resilience to catastrophic forgetting.

1. INTRODUCTION

Deep neural networks (DNNs) can learn representations of data with multiple levels of abstraction and have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics (LeCun et al., 2015; Goodfellow et al., 2016) . Using DNNs for function approximation in reinforcement learning (RL) enables the agent to generalize across states without domain-specific knowledge, and learn rich domain representations from raw, high-dimensional inputs (Mnih et al., 2015; Silver et al., 2016) . Nevertheless, the question of how to perform efficient exploration during the representation learning phase is still an open problem. The -greedy policy (Langford & Zhang, 2008) is simple to implement and widely used in practice (Mnih et al., 2015) . However, it is statistically suboptimal. Optimism in the Face of Uncertainty (Abbasi-Yadkori et al., 2011; Auer, 2002, OFU), and Thompson Sampling (Thompson, 1933; Agrawal & Goyal, 2013, TS) use confidence sets to balance exploitation and exploration. For DNNs, such confidence sets may not be accurate enough to allow efficient exploration. For example, using dropout as a posterior approximation for exploration does not concentrate on observed data (Osband et al., 2018) and was shown empirically to be insufficient (Riquelme et al., 2018) . Alternatively, pseudo-counts, a generalization of the number of visits, were used as an exploration bonus (Bellemare et al., 2016; Pathak et al., 2017) . Inspired by tabular RL, these ideas ignore the uncertainty in the value function approximation in each context. As a result, they may lead to inefficient confidence sets (Osband et al., 2018) . Linear models, on the other hand, are considered more stable and provide accurate uncertainty estimates but require substantial feature engineering to achieve good results. Additionally, they are known to work in practice only with "medium-sized" inputs (with around 1, 000 features) due to numerical issues. A natural attempt at getting the best of both worlds is to learn a linear exploration policy on top of the last hidden layer of a DNN, which we term the neural-linear approach. In RL, this approach was shown to refine the performance of DQNs (Levine et al., 2017) and improve exploration when combined with TS (Azizzadenesheli et al., 2018) and OFU (O'Donoghue et al., 2018; Zahavy et al., 2018a) . For contextual bandits, Riquelme et al. (2018) showed that neural-linear TS achieves superior performance on multiple data sets. A practical challenge for neural-linear bandits is that the representation (the activations of the last hidden layer) change after every optimization step, while the features are assumed to be fixed over time when used by linear contextual bandits. Zhou et al. (2019) recently suggested to analyze deep contextual bandits with an "infinite width" via the Neural Tangent Kernel (NTK) (Jacot et al., 2018) . Under the NTK assumptions, the optimal solution (and its features) are guaranteed to be close to the initialization point, so that the deep bandit can be viewed as a kernel method. Riquelme et al. (2018) , on the other hand, observed that with standard DNN architectures, the features do change from the initialization point and a mechanism to adapt for that change is required. They tackled this problem by storing the entire data set in a memory buffer and computing new features for all the data 1

