REINFORCEMENT LEARNING FOR BANDITS WITH CONTINUOUS ACTIONS AND LARGE CONTEXT SPACES

Abstract

We consider the challenging scenario of contextual bandits with continuous actions and large context spaces, e.g. images. We posit that by modifying reinforcement learning (RL) algorithms for continuous control, we can outperform handcrafted contextual bandit algorithms for continuous actions on standard benchmark datasets, i.e. vector contexts. We demonstrate that parametric policy networks outperform recently published tree-based policies (Majzoubi et al., 2020) in both average regret and costs on held-out samples. Furthermore, we successfully demonstrate that RL algorithms can generalise contextual bandit problems with continuous actions to large context spaces and provide state-of-the-art results on image contexts. Lastly, we introduce a new contextual bandits domain with multidimensional continuous action space and image contexts which existing methods cannot handle.

1. INTRODUCTION

We consider the challenging scenario of contextual bandits with continuous actions and large "context" spaces, e.g. images. This setting arises naturally when an agent is repeatedly requested to provide a single continuous action based on observing a context only once. In this scenario, an agent acquires a context, chooses an action from a continuous action space, and receives an immediate reward based on an unknown loss function. The process is then repeated with a new context vector. The agent's goal is to learn how to act optimally. This is usually achieved via minimising regret across actions selected over a fixed number of trials. Our work is motivated by an increasingly important application area in personalised healthcare. An agent is requested to make dosing decisions based on a patient's single 3D image scan (additional scans after treatment can potentially be damaging to the patient) (Jarrett et al., 2019) . This is an unsolved domain; current methods fail to handle the large context space associated with 3D scans and the high-dimensional actions required. We consider this problem under the contextual bandits framework. Contextual bandits are a model for single-step decision making under uncertainty where both exploration and exploitation are required in unknown environments. They pose challenges beyond classical supervised learning, since a ground truth label is not provided with each training sample. This scenario can also be considered as one-step reinforcement learning (RL), where no transition function is available and environment interactions are considered independent and identically distributed (i.i.d). There are well-established methods for contextual bandit problems with small, discrete action spaces, often known as multi-armed bandits with side information. The optimal trade-off between exploration and exploitation is well studied in this setting and formal bounds on regret have been established (Gittins et al., 1989; Auer et al., 2002; Li et al., 2010; Garivier & Moulines, 2011; Agarwal et al., 2014) . However, there is relatively little research into continuous action spaces. Recent works have focused on extreme classification using tree-based methods to sample actions from a discretized action space with smoothing (Krishnamurthy et al., 2020; Majzoubi et al., 2020) . However, developments in RL for continuous control beg the question: Can we use one-step policy gradients to solve contextual bandits with continuous actions?

