NEURWIN: NEURAL WHITTLE INDEX NETWORK FOR RESTLESS BANDITS VIA DEEP RL

Abstract

Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless bandits by leveraging mathematical properties of the Whittle indices. We show that a neural network that produces the Whittle index is also one that produces the optimal control for a set of Markov decision problems. This property motivates using deep reinforcement learning for the training of NeurWIN. We demonstrate the utility of NeurWIN by evaluating its performance for three recently studied restless bandit problems. Our experiment results show that the performance of NeurWIN is either better than, or as good as, state-of-the-art policies for all three problems.

1. INTRODUCTION

Many sequential decision problems can be modeled as multi-armed bandit problems. A bandit problem models each potential decision as an arm. In each round, we play M arms out of a total of N arms by choosing the corresponding decisions. We then receive a reward from the played arms. The goal is to maximize the long-term total discounted reward. Consider, for example, displaying advertisements on an online platform with the goal to maximize the long-term discounted clickthrough rates. This can be modeled as a bandit problem where each arm is a piece of advertisement and we choose which advertisements to be displayed every time a particular user visits the platform. It should be noted that the reward, i.e., click-through rate, of an arm is not stationary, but depends on our actions in the past. For example, a user that just clicked on a particular advertisement may be much less likely to click on the same advertisement in the near future. Such a problem is a classic case of the restless bandit problem, where the reward distribution of an arm depends on its state, which changes over time based on our past actions. The restless bandit problem is notoriously intractable (Papadimitriou & Tsitsiklis, 1999) . Most recent efforts, such as recovering bandits (Pike-Burke & Grunewalder, 2019), rotting bandits (Seznec et al., 2020) , and Brownian bandits (Slivkins & Upfal, 2008) , only study some special instances of the restless bandit problem. The fundamental challenge of the restless bandit problem lies in the explosion of state space, as the state of the entire system is the Cartesian product of the states of individual arms. A powerful tool to address the explosion of state space is the Whittle index policy (Whittle, 1988) . In a nutshell, the Whittle index policy calculates a Whittle index for each arm based on the arm's current state, where the index loosely corresponds to the amount of cost that we are willing to pay to play the arm, and then plays the arm with the highest index. It has been shown that the Whittle index policy is either optimal or asymptotically optimal in many settings. In this paper, we present Neural Whittle Index Network (NeurWIN), a principled machine learning approach that finds the Whittle indices for virtually all restless bandit problems. We note that the Whittle index is an artificial construct that cannot be directly measured. Finding the Whittle index is typically intractable. As a result, the Whittle indices of many practical problems remain unknown except for a few special cases. We are able to circumvent the challenges of finding the Whittle indices by leveraging an important mathematical property of the Whittle index: Consider an alternative problem where there is only one arm and we decide whether to play the arm in each time instance. In this problem, we need to pay a constant cost of λ every time we play the arm. The goal is to maximize the long-term discounted net reward, defined as the difference between the rewards we obtain from the arm and the costs we pay to play it. Then, the optimal policy is to play the arm whenever the Whittle index becomes larger than λ. Based on this property, a neural network that produces the Whittle index can be viewed as one that finds the optimal policy for the alternative problem for any λ. Using this observation, we propose a deep reinforcement learning method to train NeurWIN. To demonstrate the power of NeurWIN, we employ NeurWIN for three recently studied restless bandit problems, namely, recovering bandit (Pike-Burke & Grunewalder, 2019), wireless scheduling (Aalto et al., 2015) , and stochastic deadline scheduling (Yu et al., 2018) . There is no known Whittle index for the first problem, and there is only an approximation of the Whittle index under some relaxations for the second problem. Only the third problem has a precise characterization of the Whittle index. For the first two problems, the index policy using our NeurWIN achieves better performance than existing studies. For the third problem, the index policy using our NeurWIN has virtually the same performance as the Whittle index policy. The rest of the paper is organized as follows: Section 2 reviews related literature. Section 3 provides formal definitions of the Whittle index and our problem statement. Section 4 introduces our training algorithm for NeurWIN. Section 5 demonstrates the utility of NeurWIN by evaluating its performance under three recently studied restless bandit problems. Finally, Section 6 concludes the paper.

2. RELATED WORK

Restless bandit problems were first introduced in (Whittle, 1988) . They are known to be intractable, and are in general PSPACE hard (Papadimitriou & Tsitsiklis, 1999) . As a result, many studies focus on finding the Whittle index policy for restless bandit problems, such as in (Le Ny et al., 2008; Meshram et al., 2018; Tripathi & Modiano, 2019; Dance & Silander, 2015) . However, these studies are only able to find the Whittle indices under various specific assumptions about the bandit problems. There has been a lot of studies on applying RL methods for bandit problems. (Dann et al., 2017) proposed a tool called Uniform-PAC for contextual bandits. (Zanette & Brunskill, 2018) described a framework-agnostic approach towards guaranteeing RL algorithms' performance. (Jiang et al., 2017) introduced contextual decision processes (CDPs) that encompass contextual bandits for RL exploration with function approximation. (Riquelme et al., 2018) compared deep neural networks with Bayesian linear regression against other posterior sampling methods. However, none of these studies are applicable to restless bandits, where the state of an arm can change over time. Deep RL algorithms have been utilized in problems that resemble restless bandit problems, including HVAC control (Wei et al., 2017 ), cyber-physical systems (Leong et al., 2020) , and dynamic multichannel access (Wang et al., 2018) . In all these cases, a major limitation for deep RL is scalability. As the state spaces grows exponentially with the number of arms, these studies can only be applied to small-scale systems, and their evaluations are limited to cases when there are at most 5 zones, 6 sensors, and 8 channels, respectively. An emerging research direction is applying machine learning algorithms to learn Whittle indices. (Borkar & Chadha, 2018) proposed employing the LSPE(0) algorithm (Yu & Bertsekas, 2009) coupled with a polynomial function approximator. The approach was applied in (Avrachenkov & Borkar, 2019) for scheduling web crawlers. However, this work can only be applied to restless bandits whose states can be represented by a single number, and it only uses a polynomial function approximator, which may have low representational power (Sutton & Barto, 2018) . (Fu et al., 2019) proposed a Q-learning based heuristic to find Whittle indices. However, as shown in its experiment results, the heuristic may not produce Whittle indices even when the training converges.

3. PROBLEM SETTING

In this section, we provide a brief overview of restless bandit problems and the Whittle index. We then formally define the problem statement.

