REINFORCEMENT LEARNING FOR BANDITS WITH CONTINUOUS ACTIONS AND LARGE CONTEXT SPACES

Abstract

We consider the challenging scenario of contextual bandits with continuous actions and large context spaces, e.g. images. We posit that by modifying reinforcement learning (RL) algorithms for continuous control, we can outperform handcrafted contextual bandit algorithms for continuous actions on standard benchmark datasets, i.e. vector contexts. We demonstrate that parametric policy networks outperform recently published tree-based policies (Majzoubi et al., 2020) in both average regret and costs on held-out samples. Furthermore, we successfully demonstrate that RL algorithms can generalise contextual bandit problems with continuous actions to large context spaces and provide state-of-the-art results on image contexts. Lastly, we introduce a new contextual bandits domain with multidimensional continuous action space and image contexts which existing methods cannot handle.

1. INTRODUCTION

We consider the challenging scenario of contextual bandits with continuous actions and large "context" spaces, e.g. images. This setting arises naturally when an agent is repeatedly requested to provide a single continuous action based on observing a context only once. In this scenario, an agent acquires a context, chooses an action from a continuous action space, and receives an immediate reward based on an unknown loss function. The process is then repeated with a new context vector. The agent's goal is to learn how to act optimally. This is usually achieved via minimising regret across actions selected over a fixed number of trials. Our work is motivated by an increasingly important application area in personalised healthcare. An agent is requested to make dosing decisions based on a patient's single 3D image scan (additional scans after treatment can potentially be damaging to the patient) (Jarrett et al., 2019) . This is an unsolved domain; current methods fail to handle the large context space associated with 3D scans and the high-dimensional actions required. We consider this problem under the contextual bandits framework. Contextual bandits are a model for single-step decision making under uncertainty where both exploration and exploitation are required in unknown environments. They pose challenges beyond classical supervised learning, since a ground truth label is not provided with each training sample. This scenario can also be considered as one-step reinforcement learning (RL), where no transition function is available and environment interactions are considered independent and identically distributed (i.i.d). There are well-established methods for contextual bandit problems with small, discrete action spaces, often known as multi-armed bandits with side information. The optimal trade-off between exploration and exploitation is well studied in this setting and formal bounds on regret have been established (Gittins et al., 1989; Auer et al., 2002; Li et al., 2010; Garivier & Moulines, 2011; Agarwal et al., 2014) . However, there is relatively little research into continuous action spaces. Recent works have focused on extreme classification using tree-based methods to sample actions from a discretized action space with smoothing (Krishnamurthy et al., 2020; Majzoubi et al., 2020) . However, developments in RL for continuous control beg the question: Can we use one-step policy gradients to solve contextual bandits with continuous actions? Real world contextual bandit problems, such as those in healthcare, require solution methods to generalise to large context spaces, i.e. directly from images. We posit that existing tree-based methods are not well suited to this task in their current form (Krishnamurthy et al., 2020; Majzoubi et al., 2020; Bejjani & Courtot, 2022) . However, recent breakthroughs in deep RL (Lillicrap et al., 2016; Arulkumaran et al., 2017; Schrittwieser et al., 2020; Hafner et al., 2021) have successfully demonstrated continuous control abilities directly from pixels. In this regard, neural networks have proven powerful and flexible function approximators, often employed as parametric policy and value networks. The adoption of nonlinear function approximators often reduces any theoretical guarantees of convergence, but works well in practice. Underpinning this recent work in deep RL for continuous control from pixels is the deterministic policy gradient that can be estimated much more efficiently than the usual stochastic policy gradient (Silver et al., 2014; Lillicrap et al., 2016) . The contributions of our work are as follows: 1. We modify an existing RL algorithm for continuous control and demonstrate state-of-theart results on contextual bandit problems with continuous actions. We evaluate on four OpenML datasets across two settings: online average regret and offline held-out costs. 2. We propose a deep contextual bandit agent that can handle image-based context representations and continuous actions. To the best of our knowledge, we are the first to tackle the challenging setting of multi-dimensional continuous actions and high-dimensional context space. We demonstrate state-of-the-art performance on image contexts. 3. We propose a new challenging contextual bandits domain for multi-dimensional continuous actions and image contexts. We provide this challenging domain as a new testbed to the community, and present initial results with our RL agent. Our new contextual bandit domain is based on a 2D game of Tanks: where two opposing tanks are situated at opposite sides of the game screen. The agent must learn control parameters to accurately fire trajectories at the enemy tank. There are three continuous action dimensions: agent x-location, turret angle and shot power. Example context images are provided in Figure 2 .

2. RELATED WORKS

A naive approach to dealing with continuous actions is to simply discretise the action space (Slivkins et al., 2019) . A major limitation of this approach is the curse of dimensionality: the number of possible actions increases exponentially with the number of action dimensions. This is exacerbated for tasks that require fine control of actions, as they require a correspondingly finer grained discretisation, leading to an explosion of the number of discrete actions. A simple fixed discretization of actions over the continuous space has been shown to be wasteful, and has led to adaptive discretization methods, e.g. the Zooming Algorithm (Kleinberg et al., 2008; 2019) . Building upon the discretisation approach, extreme classification algorithms were recently developed (Krishnamurthy et al., 2020; Majzoubi et al., 2020) . These works use tree-based policies and introduce the idea of smoothing over actions in order to create a probability density function over the entire action space. The authors provide a computationally-tractable algorithm for large-scale experiments with continuous actions. However, their optimal performance guarantees scale inversely with the bandwidth parameter (the uniform smooth region around the discretised action). The authors of these works provide no theoretical or empirical analysis investigating the effects of large context sizes. For these reasons, we propose an RL agent based on one-step policy gradients (Sutton et al., 1999 ) that scales well with context size to handle continuous actions and large context spaces. Prior works have framed personalised healthcare problems as contextual bandits problems (Kallus & Zhou, 2018; Rindtorff et al., 2019) . A one-step actor-critic method was proposed for binary actions relating to just-in-time adaptive interventions (Lei et al., 2017) . However, until now, these methods have been restricted to discrete actions and small context vectors. In contrast, the notion of deep contextual bandits has previously been introduced in (Zhu & Rigotti, 2021) to handle large context spaces. The authors propose a novel sample average uncertainty method, however, it is only suitable for discrete action spaces. To the best of our knowledge, no prior works have focused on the challenging intersection of continuous actions and large context spaces, and it is an under-explored research area of particular interest in healthcare.

