DEEP JUMP Q-EVALUATION FOR OFFLINE POLICY EVALUATION IN CONTINUOUS ACTION SPACE Anonymous

Abstract

We consider off-policy evaluation (OPE) in continuous action domains, such as dynamic pricing and personalized dose finding. In OPE, one aims to learn the value under a new policy using historical data generated by a different behavior policy. Most existing works on OPE focus on discrete action domains. To handle continuous action space, we develop a brand-new deep jump Q-evaluation method for OPE. The key ingredient of our method lies in adaptively discretizing the action space using deep jump Q-learning. This allows us to apply existing OPE methods in discrete domains to handle continuous actions. Our method is further justified by theoretical results, synthetic and real datasets.

1. INTRODUCTION

Individualization proposes to leverage omni-channel data to meet individual needs. Individualized decision making plays a vital role in a wide variety of applications. Examples include customized pricing strategy in economics (Qiang & Bayati, 2016; Turvey, 2017) , individualized treatment regime in medicine (Chakraborty, 2013; Collins & Varmus, 2015) , personalized recommendation system in marketing (McInerney et al., 2018; Fong et al., 2018) , etc. Prior to adopting any decision rule in practice, it is crucial to know the impact of implementing such a policy. In many applications, it is risky to run a policy online to estimate its value (see, e.g., Li et al., 2011) . Off-policy evaluation (OPE) thus attracts a lot of attention by learning the policy value offline using logged historical data. Despite the popularity of developing OPE methods with a finite set of actions (see e.g., Dudík et al., 2011; 2014; Swaminathan et al., 2017; Wang et al., 2017) , less attention has been paid to continuous action domains, such as dynamic pricing (den Boer & Keskin, 2020) and personalized dose finding (Chen et al., 2016) . Recently, a few OPE methods have been proposed to handle continuous actions (Kallus & Zhou, 2018; Sondhi et al., 2020; Colangelo & Lee, 2020) . All these methods rely on the use of a kernel function to extend the inverse probability weighting (IPW) or doubly robust (DR) approaches developed in discrete action domains. They suffer from three limitations. First, the validity of these methods requires the conditional mean of the reward given the feature-action pair to be a smooth function over the action space. This assumption could be violated in applications such as dynamic pricing, where the expected demand for a product has jump discontinuities as a function of the charged price (den Boer & Keskin, 2020). Second, the value estimator could be sensitive to the choice of the bandwidth parameter in the kernel function. It remains challenging to select this hyperparameter. Kallus & Zhou (2018) proposed to tune this parameter by minimizing the mean squared error of the resulting value estimator. However, their method is extremely computationally intensive in moderate or high-dimensional feature space; see Section 5 for details. Third, these kernel-based methods typically use a single bandwidth parameter. This is sub-optimal in cases where the second-order derivative of the conditional mean function has an abrupt change in the action space; see the toy example in Section 3.1 for details. To address these limitations, we develop a deep jump Q-evaluation (DJQE) method by integrating multi-scale change point detection (see e.g., Fryzlewicz, 2014 ), deep learning (LeCun et al., 2015) and OPE in discrete action domains. The key ingredient of our method lies in adaptively discretizing the action space using deep jump Q-learning. This allows us to apply IPW or DR methods to handle continuous actions. It is worth mentioning that our method does not require kernel bandwidth selection. Theoretically, we show it allows the conditional mean to be either a continuous or piecewise function of the action (Theorems 1 and 2) and converges faster than kernel-based OPE (Theorem 3). Empirically, we show it outperforms state-of-the-art OPE methods in synthetic and real datasets.

