ONLINE LIMITED MEMORY NEURAL-LINEAR BAN

Abstract

We study neural-linear bandits for solving problems where both exploration and representation learning play an important role. Neural-linear bandits leverage the representation power of deep neural networks and combine it with efficient exploration mechanisms, designed for linear contextual bandits, on top of the last hidden layer. Since the representation is optimized during learning, information regarding exploration with "old" features is lost. We propose the first limited memory neurallinear bandit that is resilient to this catastrophic forgetting phenomenon by solving a semi-definite program. We then approximate the semi-definite program using stochastic gradient descent to make the algorithm practical and adjusted for online usage. We perform simulations on a variety of data sets, including regression, classification, and sentiment analysis. We observe that our algorithm achieves superior performance and shows resilience to catastrophic forgetting.

1. INTRODUCTION

Deep neural networks (DNNs) can learn representations of data with multiple levels of abstraction and have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics (LeCun et al., 2015; Goodfellow et al., 2016) . Using DNNs for function approximation in reinforcement learning (RL) enables the agent to generalize across states without domain-specific knowledge, and learn rich domain representations from raw, high-dimensional inputs (Mnih et al., 2015; Silver et al., 2016) . Nevertheless, the question of how to perform efficient exploration during the representation learning phase is still an open problem. The -greedy policy (Langford & Zhang, 2008) is simple to implement and widely used in practice (Mnih et al., 2015) . However, it is statistically suboptimal. Optimism in the Face of Uncertainty (Abbasi-Yadkori et al., 2011; Auer, 2002, OFU), and Thompson Sampling (Thompson, 1933; Agrawal & Goyal, 2013, TS) use confidence sets to balance exploitation and exploration. For DNNs, such confidence sets may not be accurate enough to allow efficient exploration. For example, using dropout as a posterior approximation for exploration does not concentrate on observed data (Osband et al., 2018) and was shown empirically to be insufficient (Riquelme et al., 2018) . Alternatively, pseudo-counts, a generalization of the number of visits, were used as an exploration bonus (Bellemare et al., 2016; Pathak et al., 2017) . Inspired by tabular RL, these ideas ignore the uncertainty in the value function approximation in each context. As a result, they may lead to inefficient confidence sets (Osband et al., 2018) . Linear models, on the other hand, are considered more stable and provide accurate uncertainty estimates but require substantial feature engineering to achieve good results. Additionally, they are known to work in practice only with "medium-sized" inputs (with around 1, 000 features) due to numerical issues. A natural attempt at getting the best of both worlds is to learn a linear exploration policy on top of the last hidden layer of a DNN, which we term the neural-linear approach. In RL, this approach was shown to refine the performance of DQNs (Levine et al., 2017) and improve exploration when combined with TS (Azizzadenesheli et al., 2018) and OFU (O'Donoghue et al., 2018; Zahavy et al., 2018a) . For contextual bandits, Riquelme et al. (2018) showed that neural-linear TS achieves superior performance on multiple data sets. A practical challenge for neural-linear bandits is that the representation (the activations of the last hidden layer) change after every optimization step, while the features are assumed to be fixed over time when used by linear contextual bandits. Zhou et al. (2019) recently suggested to analyze deep contextual bandits with an "infinite width" via the Neural Tangent Kernel (NTK) (Jacot et al., 2018) . Under the NTK assumptions, the optimal solution (and its features) are guaranteed to be close to the initialization point, so that the deep bandit can be viewed as a kernel method. Riquelme et al. (2018) , on the other hand, observed that with standard DNN architectures, the features do change from the initialization point and a mechanism to adapt for that change is required. They tackled this problem by storing the entire data set in a memory buffer and computing new features for all the data after each DNN learning phase. The authors also experimented with a bounded memory buffer but observed a significant decrease in performance due to catastrophic forgetting (Kirkpatrick et al., 2017) , i.e., a loss of information from previous experience. In this work, we propose a neural-linear bandit that uses TS on top of the last layer of a DNN. Key to our approach is a novel method to compute priors whenever the DNN features change that makes our algorithm resilient to catastrophic forgetting. Specifically, we adjust the moments of the likelihood of the reward estimation conditioned on new features to match the likelihood conditioned on old features. We achieve this by solving a semi-definite program (Vandenberghe & Boyd, 1996, SDP) to approximate the covariance and using the weights of the last layer as prior to the mean. To make the algorithm more appealing for real-time usage, we implement it in an online manner, in which updates of the DNN weights and the priors are done simultaneously every step by using stochastic gradient descent (SGD) followed by projection of the priors. This obviates the need to process the whole memory buffer after each DNN learning phase and keeps the computational burden of our algorithm small. We performed experiments on several real-world and simulated data sets, including classification and regression, using Multi-Layered Perceptrons (MLPs). These experiments suggest that our prior approximation scheme improves performance significantly when memory is limited. We demonstrate that our neural-linear bandit performs well in a sentiment analysis data set where the input is given in natural language (there are 8k features), and we use a Convolution Neural Network (CNNs). In this regime, it is not feasible to use a linear method due to computational problems. In addition, we evaluate our algorithm in a stochastic simulation of an uplink video-transmission application. In this application, the length of the simulation is so long that it is not possible to use the unlimited memory neural-linear approach of Riquelme et al. (2018) . To the best of our knowledge, this is the first neural-linear algorithm that is resilient to catastrophic forgetting due to limited memory. In addition, unlike Riquelme et al. ( 2018), which use a patch-based approach, our algorithm can be configured to work in an online manner, in which the DNN and statistics are efficiently updated each step. Thus, this is also the first neural-linear online algorithm.

2. BACKGROUND

The stochastic, contextual (linear) multi-armed bandit problem. At every time t, a contextual bandit algorithm observes a context b(t) and chooses an arm a(t) ∈ [1, . . . , N ]. The bandit can use the history H t-1 to make its decisions, where H t-1 = {b(τ ), a(τ ), r a(τ ) (τ ), τ = 1, ..., t -1}, and a(τ ) denotes the arm played at time τ . Most existing works typically make the following realizability assumption (Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013) . Assumption 1. The reward for arm i at time t is generated from an (unknown) distribution s.t. E [r i (t)|b(t), H t-1 ] = E [r i (t)|b(t)] = b(t) T µ i , where {µ i ∈ R d } N i=1 are fixed but unknown. Let a * (t) denote the optimal arm at time t, i.e. a * (t) = arg max i b(t) T µ i , and let ∆ i (t) the difference between the mean rewards of the optimal arm and of arm i at time t, i.e., ∆ i (t) = b(t) T µ a * (t)b(t) T µ i . The objective is to minimize the total regret R(T ) = T t=1 ∆ a(t) , where T is finite. Algorithm 1 TS for linear contextual bandits ∀i ∈ [1.., N ], set Φ i = I d , μi = 0 d , f i = 0 d for t = 1, 2, . . . , do ∀i ∈ [1.., N ], sample μi from N ( μi , v 2 Φ -1 i ) Play arm a(t) := arg max i b(t) T μi Observe reward r t Update: Φ a(t) = Φ a(t) + b(t)b(t) T f a(t) = f a(t) + b(t)r t , μa(t) = Φ -1 a(t) f a(t) end for TS for linear contextual bandits. Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance (Russo et al., 2018; Lattimore & Szepesvári, 2018) . For linear contextual bandits, TS was introduced in (Agrawal & Goyal, 2013, Alg. 1). Suppose that the likelihood of reward r i (t), given context b(t) and parameter µ i , were given by the pdf of Gaussian distribution N (b(t) T µ i , ν 2 ), and let Φ i (t) = Φ 0 i + t-1 τ =1 b(τ )b(τ ) T 1 i=a(τ ) , μi (t) = Φ -1 i (t) t-1 τ =1 b(τ )r a(τ ) (τ )1 i=a(τ ) , where 1 is the indicator function and Φ 0 i is the precision prior. Given a Gaussian prior for arm i at time t, N (μ i (t), v 2 Φ -1 i (t)), the posterior distribution at time t + 1 is given by, P r(μ i |r i (t)) ∝ P r(r i (t)|μ i )P r(μ i ) ∝ N (μ i (t + 1), v 2 Φ -1 i (t + 1)).

