DEEP JUMP Q-EVALUATION FOR OFFLINE POLICY EVALUATION IN CONTINUOUS ACTION SPACE Anonymous

Abstract

We consider off-policy evaluation (OPE) in continuous action domains, such as dynamic pricing and personalized dose finding. In OPE, one aims to learn the value under a new policy using historical data generated by a different behavior policy. Most existing works on OPE focus on discrete action domains. To handle continuous action space, we develop a brand-new deep jump Q-evaluation method for OPE. The key ingredient of our method lies in adaptively discretizing the action space using deep jump Q-learning. This allows us to apply existing OPE methods in discrete domains to handle continuous actions. Our method is further justified by theoretical results, synthetic and real datasets.

1. INTRODUCTION

Individualization proposes to leverage omni-channel data to meet individual needs. Individualized decision making plays a vital role in a wide variety of applications. Examples include customized pricing strategy in economics (Qiang & Bayati, 2016; Turvey, 2017) , individualized treatment regime in medicine (Chakraborty, 2013; Collins & Varmus, 2015) , personalized recommendation system in marketing (McInerney et al., 2018; Fong et al., 2018) , etc. Prior to adopting any decision rule in practice, it is crucial to know the impact of implementing such a policy. In many applications, it is risky to run a policy online to estimate its value (see, e.g., Li et al., 2011) . Off-policy evaluation (OPE) thus attracts a lot of attention by learning the policy value offline using logged historical data. Despite the popularity of developing OPE methods with a finite set of actions (see e.g., Dudík et al., 2011; 2014; Swaminathan et al., 2017; Wang et al., 2017) , less attention has been paid to continuous action domains, such as dynamic pricing (den Boer & Keskin, 2020) and personalized dose finding (Chen et al., 2016) . Recently, a few OPE methods have been proposed to handle continuous actions (Kallus & Zhou, 2018; Sondhi et al., 2020; Colangelo & Lee, 2020) . All these methods rely on the use of a kernel function to extend the inverse probability weighting (IPW) or doubly robust (DR) approaches developed in discrete action domains. They suffer from three limitations. First, the validity of these methods requires the conditional mean of the reward given the feature-action pair to be a smooth function over the action space. This assumption could be violated in applications such as dynamic pricing, where the expected demand for a product has jump discontinuities as a function of the charged price (den Boer & Keskin, 2020). Second, the value estimator could be sensitive to the choice of the bandwidth parameter in the kernel function. It remains challenging to select this hyperparameter. Kallus & Zhou (2018) proposed to tune this parameter by minimizing the mean squared error of the resulting value estimator. However, their method is extremely computationally intensive in moderate or high-dimensional feature space; see Section 5 for details. Third, these kernel-based methods typically use a single bandwidth parameter. This is sub-optimal in cases where the second-order derivative of the conditional mean function has an abrupt change in the action space; see the toy example in Section 3.1 for details. To address these limitations, we develop a deep jump Q-evaluation (DJQE) method by integrating multi-scale change point detection (see e.g., Fryzlewicz, 2014), deep learning (LeCun et al., 2015) and OPE in discrete action domains. The key ingredient of our method lies in adaptively discretizing the action space using deep jump Q-learning. This allows us to apply IPW or DR methods to handle continuous actions. It is worth mentioning that our method does not require kernel bandwidth selection. Theoretically, we show it allows the conditional mean to be either a continuous or piecewise function of the action (Theorems 1 and 2) and converges faster than kernel-based OPE (Theorem 3). Empirically, we show it outperforms state-of-the-art OPE methods in synthetic and real datasets.

2. PRELIMINARIES

We first formulate the OPE problem. We next discuss the kernel-based OPE methods and multi-scale change point detection, since our proposal is closely related to them.

2.1. OFF-POLICY EVALUATION

The observed datasets can be summarized into {(X i , A i , Y i )} 1≤i≤n where O i = (X i , A i , Y i ) denotes the feature-action-reward triplet for the ith subject and n denotes the total sample size. We assume these data triplets are independent copies of some population variables (X, A, Y ). Let X and A denote the feature and action space, respectively. We focus on the setting where A is one-dimensional, as in dynamic pricing and personalized dose finding. A deterministic policy π : X → A determines the action to be assigned given the observed feature. We use b to denote the behavior policy that generates the observed data. Specifically, b(•|x) denotes the probability density or mass function of A given X = x, depending on whether A is continuous or not. Define the expected reward function conditional on the feature-action pair as Q(x, a) = E{Y |X = x, A = a}. We refer to this function as the Q-function, to be consistent with the literature on developing individualized treatment regime (Murphy, 2003) . As standard in the OPE and the causal inference literature (see e.g., Chen et al., 2016) , we assume the stable unit treatment value assumption (SUTVA), no unmeasured confounders assumption, and the positivity assumption are satisfied. These assumptions guarantee that a policy's value is estimable from the observed data. Specifically, for a given target policy π, its value can be represented by V (π) = E{Q(X, π(X))}. The goal of the OPE is to learn the value V (π) based on the observed data.  1 n n i=1 ψ(O i , π, Q, b) = 1 n n i=1 Q(X i , π(X i )) + I(A i = π(X i )) b(A i |X i ) {Y i -Q(X i , π(X i ))} , where I denotes the indicator function, Q and b denote some estimators for the Q-function and the behavior policy. The second term b -1 (A i |X i )I(A i = π(X i )){Y i -Q(X i , π(X i )) } inside the bracket corresponds to an augmentation term. Its expectation equals zero when Q = Q. The purpose of adding this term is to offer additional protection against potential model misspecification of the Q-function. Such an estimator is doubly-robust in the sense that its consistency relies on either Q or b to be correctly specified. By setting Q = 0, equation 1 is reduced to the IPW estimator. In continuous action domains, the indicator function I(A i = π(X i )) equals zero almost surely. Consequently, naively applying equation 1 yields the plug-in estimator n i=1 Q(X i , π(X i ))/n. To address this concern, the kernel-based OPE proposed to replace the indicator function in equation 1 with a kernel function K{(A i -π(X i ))/h} with some bandwidth parameter h, i.e., 1 n n i=1 ψ h (O i , π, Q, b) = 1 n n i=1 Q(X i , π(X i )) + K{(A i -π(X i ))/h} b(A i |X i ) {Y i -Q(X i , π(X i ))} . The bandwidth h represents a trade-off. The variance of the resulting value estimator decays with h. Yet, its bias increases with h. More specifically, it follows from Theorem 1 of Kallus & Zhou (2018) that the leading term of the bias is equal to h 2 u 2 K(u)du 2 E ∂ 2 Q(X, a) ∂a 2 a=π(X) . (2) To ensure the term in 2 decays to zero as h goes to 0, it requires the expected second derivative of the Q-function to exist, and thus Q(x, a) needs to be a smooth function of a. However, as commented in the introduction, this assumption could be violated in applications such as dynamic pricing.



KERNEL-BASED OPE For discrete action, Zhang et al. (2012) and Dudík et al. (2011) proposed a DR estimator of V (π) by

