DEEP JUMP Q-EVALUATION FOR OFFLINE POLICY EVALUATION IN CONTINUOUS ACTION SPACE Anonymous

Abstract

We consider off-policy evaluation (OPE) in continuous action domains, such as dynamic pricing and personalized dose finding. In OPE, one aims to learn the value under a new policy using historical data generated by a different behavior policy. Most existing works on OPE focus on discrete action domains. To handle continuous action space, we develop a brand-new deep jump Q-evaluation method for OPE. The key ingredient of our method lies in adaptively discretizing the action space using deep jump Q-learning. This allows us to apply existing OPE methods in discrete domains to handle continuous actions. Our method is further justified by theoretical results, synthetic and real datasets.

1. INTRODUCTION

Individualization proposes to leverage omni-channel data to meet individual needs. Individualized decision making plays a vital role in a wide variety of applications. Examples include customized pricing strategy in economics (Qiang & Bayati, 2016; Turvey, 2017) , individualized treatment regime in medicine (Chakraborty, 2013; Collins & Varmus, 2015) , personalized recommendation system in marketing (McInerney et al., 2018; Fong et al., 2018) , etc. Prior to adopting any decision rule in practice, it is crucial to know the impact of implementing such a policy. In many applications, it is risky to run a policy online to estimate its value (see, e.g., Li et al., 2011) . Off-policy evaluation (OPE) thus attracts a lot of attention by learning the policy value offline using logged historical data. Despite the popularity of developing OPE methods with a finite set of actions (see e.g., Dudík et al., 2011; 2014; Swaminathan et al., 2017; Wang et al., 2017) , less attention has been paid to continuous action domains, such as dynamic pricing (den Boer & Keskin, 2020) and personalized dose finding (Chen et al., 2016) . Recently, a few OPE methods have been proposed to handle continuous actions (Kallus & Zhou, 2018; Sondhi et al., 2020; Colangelo & Lee, 2020) . All these methods rely on the use of a kernel function to extend the inverse probability weighting (IPW) or doubly robust (DR) approaches developed in discrete action domains. They suffer from three limitations. First, the validity of these methods requires the conditional mean of the reward given the feature-action pair to be a smooth function over the action space. This assumption could be violated in applications such as dynamic pricing, where the expected demand for a product has jump discontinuities as a function of the charged price (den Boer & Keskin, 2020) . Second, the value estimator could be sensitive to the choice of the bandwidth parameter in the kernel function. It remains challenging to select this hyperparameter. Kallus & Zhou (2018) proposed to tune this parameter by minimizing the mean squared error of the resulting value estimator. However, their method is extremely computationally intensive in moderate or high-dimensional feature space; see Section 5 for details. Third, these kernel-based methods typically use a single bandwidth parameter. This is sub-optimal in cases where the second-order derivative of the conditional mean function has an abrupt change in the action space; see the toy example in Section 3.1 for details. To address these limitations, we develop a deep jump Q-evaluation (DJQE) method by integrating multi-scale change point detection (see e.g., Fryzlewicz, 2014 ), deep learning (LeCun et al., 2015) and OPE in discrete action domains. The key ingredient of our method lies in adaptively discretizing the action space using deep jump Q-learning. This allows us to apply IPW or DR methods to handle continuous actions. It is worth mentioning that our method does not require kernel bandwidth selection. Theoretically, we show it allows the conditional mean to be either a continuous or piecewise function of the action (Theorems 1 and 2) and converges faster than kernel-based OPE (Theorem 3). Empirically, we show it outperforms state-of-the-art OPE methods in synthetic and real datasets.

2. PRELIMINARIES

We first formulate the OPE problem. We next discuss the kernel-based OPE methods and multi-scale change point detection, since our proposal is closely related to them.

2.1. OFF-POLICY EVALUATION

The observed datasets can be summarized into {(X i , A i , Y i )} 1≤i≤n where O i = (X i , A i , Y i ) denotes the feature-action-reward triplet for the ith subject and n denotes the total sample size. We assume these data triplets are independent copies of some population variables (X, A, Y ). Let X and A denote the feature and action space, respectively. We focus on the setting where A is one-dimensional, as in dynamic pricing and personalized dose finding. A deterministic policy π : X → A determines the action to be assigned given the observed feature. We use b to denote the behavior policy that generates the observed data. Specifically, b(•|x) denotes the probability density or mass function of A given X = x, depending on whether A is continuous or not. Define the expected reward function conditional on the feature-action pair as Q(x, a) = E{Y |X = x, A = a}. We refer to this function as the Q-function, to be consistent with the literature on developing individualized treatment regime (Murphy, 2003) . As standard in the OPE and the causal inference literature (see e.g., Chen et al., 2016) , we assume the stable unit treatment value assumption (SUTVA), no unmeasured confounders assumption, and the positivity assumption are satisfied. These assumptions guarantee that a policy's value is estimable from the observed data. Specifically, for a given target policy π, its value can be represented by V (π) = E{Q(X, π(X))}. The goal of the OPE is to learn the value V (π) based on the observed data.

2.2. KERNEL-BASED OPE

For discrete action, Zhang et al. (2012) and Dudík et al. (2011) proposed a DR estimator of V (π) by 1 n n i=1 ψ(O i , π, Q, b) = 1 n n i=1 Q(X i , π(X i )) + I(A i = π(X i )) b(A i |X i ) {Y i -Q(X i , π(X i ))} , where I denotes the indicator function, Q and b denote some estimators for the Q-function and the behavior policy. The second term b -1 (A i |X i )I(A i = π(X i )){Y i -Q(X i , π(X i ))} inside the bracket corresponds to an augmentation term. Its expectation equals zero when Q = Q. The purpose of adding this term is to offer additional protection against potential model misspecification of the Q-function. Such an estimator is doubly-robust in the sense that its consistency relies on either Q or b to be correctly specified. By setting Q = 0, equation 1 is reduced to the IPW estimator. In continuous action domains, the indicator function I(A i = π(X i )) equals zero almost surely. Consequently, naively applying equation 1 yields the plug-in estimator n i=1 Q(X i , π(X i ))/n. To address this concern, the kernel-based OPE proposed to replace the indicator function in equation 1 with a kernel function K{(A i -π(X i ))/h} with some bandwidth parameter h, i.e., 1 n n i=1 ψ h (O i , π, Q, b) = 1 n n i=1 Q(X i , π(X i )) + K{(A i -π(X i ))/h} b(A i |X i ) {Y i -Q(X i , π(X i ))} . The bandwidth h represents a trade-off. The variance of the resulting value estimator decays with h. Yet, its bias increases with h. More specifically, it follows from Theorem 1 of Kallus & Zhou (2018) that the leading term of the bias is equal to h 2 u 2 K(u)du 2 E ∂ 2 Q(X, a) ∂a 2 a=π(X) . (2) To ensure the term in 2 decays to zero as h goes to 0, it requires the expected second derivative of the Q-function to exist, and thus Q(x, a) needs to be a smooth function of a. However, as commented in the introduction, this assumption could be violated in applications such as dynamic pricing. Table 1 : The bias and the standard deviation (in parentheses) of the estimated values for V (1) and V (2) , using DJQE and kernel-based methods. n = 100, X, A ∼ Unif[0, 1], Y |X, A ∼ N {Q(X, A), 1}. The target policy is given by π(x) = x. Method DJQE Kernel (small h, h = 0.8) Kernel (large h, h = 2) V (1) (π) 0.34 (0.13) 0.65 (0.17 

2.3. MULTI-SCALE CHANGE POINT DETECTION

The change point analysis considers an ordered sequence of data, Y 1:n = {Y 1 , • • • , Y n }, with un- known change point locations, τ = {τ 1 , • • • , τ K } for some unknown integer K. Here, τ i is an integer between 1 and n -1 inclusive, and satisfies τ i < τ j for i < j. These change points split the data into K +1 segment. Within each segments, the expected response is a constant function (see the left panel of Figure 1 for details). A number of methods have been proposed on estimating change points (see for example, Boysen et al., 2009; Killick et al., 2012; Frick et al., 2014; Fryzlewicz, 2014 , and the references therein), by minimizing a penalized objective function: arg min τ,K 1 n K+1 i=1 C{Y (τi-1+1):τi } + γK , where C is a cost function that measures the goodness-of-the-fit of the constant function within each segment and γK penalizes the number of change points with some regularization parameter γ. We remark that all the above cited works focused on models without features. Our proposal goes beyond these works in that we consider models with features and use deep neural networks (DNN) to capture the complex relationship between the response and features.

3. DEEP JUMP Q-EVALUATION

In section 3.1, we use a toy example to demonstrate the limitation of kernel-based methods. We present the main idea of our algorithm in Section 3.2. Details are given in Section 3.3.

3.1. TOY EXAMPLE

As discussed in the introduction, existing kernel-based OPE methods use a single bandwidth to construct the value estimator. Ideally, the bandwidth h in the kernel K{(A i -π(X i ))/h} shall vary with π(X i ) to improve the accuracy of the value estimator. To elaborate this, consider the Q-function Q(x, a) = 10 max{a 2 -0.25, 0} log(x + 2) for any x, a ∈ [0, 1]. By definition, the Q-function is smooth over the entire feature-action space. However, it has different "patterns" when the action belongs to different intervals. Specifically, for a ∈ [0, 0.5], Q(x, a) is constant as a function of a. For a ∈ (0.5, 1], Q(x, a) depends quadratically in a. See the middle panel of Figure 1 for details. Consider the target policy π(x) = x. We decompose the value V (π) into V (1) (π) + V (2) (π) where V (1) (π) = EQ(X, π(X))I(π(X) ≤ 0.5) and V (2) (π) = EQ(X, π(X))I(π(X) > 0.5). Similarly, denote the corresponding kernel-based value estimators by V (1) h (π) = 1 n n i=1 ψ h (O i , π, Q, b)I(π(X i ) ≤ 0.5) and V (2) h (π) = 1 n n i=1 ψ h (O i , π, Q, b)I(π(X i ) > 0.5). Since Q(x, a) is a constant function of a ∈ [0, 0.5], its second-order derivative ∂ 2 Q(x, a)/∂a 2 equals zero. In view of 2, when π(x) ≤ 0.5, the bias of V (1) h (π) will be small even with a sufficiently large h. As such, a large h is preferred to reduce the variance of V (1) h (π). When π(x) > 0.5, a small h is preferred to reduce the bias of V (2) h (π). See Table 1 for details where we report the bias and standard deviation of V (1) h (π) and V (2) h (π) with two different bandwidths. Due to the use of a single bandwidth, the kernel-based estimator suffers from either a large bias or a large variance. To overcome this limitation, we propose to adaptively discretize the action space into a union of disjoint intervals such that within each interval I, the Q-function {Q(x, a) : a ∈ I} can be well-approximated by some function Q I (x) that is constant in a ∈ I. Based on the discretization, one can apply IPW or DR to evaluate the value. The advantage of adaptive discretization is illustrated in the right panel of Figure 1 . When a ≤ 0.5, the Q-function is constant in a. It is likely that our procedure will not further split the interval [0, 0.5]. Consequently, the corresponding DR estimator for V (1) (π) will not suffer from large variance. When a > 0.5, our procedure will split (0.5, 1] into a series of sub-intervals, approximating Q by a step function. This guarantees the resulting DR estimator for V (2) (π) will not suffer from large bias. Consequently, the proposed value estimator achieves a smaller mean squared error than kernel-based estimators. See Table 1 for details.

3.2. THE MAIN IDEA

For simplicity, we set the action space A = [0, 1]. From now on, we focus on a subset of intervals in [0, 1] . By interval we always refer to those of the form [c, d) for some 0 ≤ c < d < 1, or [c, 1] for some 0 ≤ c < 1, denoted as I. A discretization D for A is defined as a collection of mutually disjoint intervals that covers A. Let |D| denote the number of intervals in D and |I| denote the length of the interval I. We aim to identify an "optimal" discretization D such that for each interval I ∈ D, Q(x, a) is approximately a constant function of a ∈ I. The number of intervals in D represents a trade-off. If | D| is too large, then D will contain many short intervals, the resulting IPW or DR estimator might suffer from large variance. Yet, a smaller value of | D| might result in a large bias. Our proposed method adaptively determines D and its size | D| as illustrated below. To begin with, we cut the entire action space A into m initial intervals: [0, 1/m),[1/m, 2/m), . . . , [(m -1)/m, 1]. The number m shall be sufficiently large such that the Q-function can be wellapproximated by a piecewise function on these intervals. In practice, we recommend to set the initial number of intervals m to be proportional to the sample size n. Note the set of these initial intervals is not the final partition D that we recommend, but only serve as the initial candidate intervals. We next adaptively combine some of these initial intervals to form the final partition D. As shown in our numerical studies (see Table 5 in Appendix B for more details), the size of the final partition | D| is usually much less than m. More specifically, denote by B(m) as the set of discretizations D such that the end-points of each interval I ∈ D lie on the grid {j/m : j = 0, 1, • • • , m}. We associate to each partition D ∈ B(m) a collection of functions {Q I } I∈D . These functions depend only on features, not the action. They are used to produce a piecewise approximation of the Q-function such that Q(a, •) ≈ I I(a ∈ I)Q I (•). We model these Q I using deep neural networks, to capture the complex dependence between the response and features. When the Q-function is well-approximated, we expect the least square loss I∈D n i=1 I(A i ∈ I) Y i -Q I (X i ) 2 , will be small. Consequently, D can be estimated by solving ( D, { q I : I ∈ D}) = arg min (D∈B(m),{Q I ∈Q:I∈D}) I∈D 1 n n i=1 I(A i ∈ I) Y i -Q I (X i ) 2 + γ|D| , for some regularization parameter γ and DNN class Q. Here, the penalty term γ|D| in equation 3 controls the total number of intervals in D, as in multi-scale change point detection. A large γ results in few intervals in D and a potential large bias of the value estimator, whereas a small γ could procedure a large number of intervals in D, leading to a noisy value estimator. In practice, we use cross-validation to select the regularization parameter γ that minimizes the mean square error of the fitted Q-function. We refer to this step as deep jump Q-learning. Details of this step are given in the next section. Given D, one can apply IPW or DR (see equation 1) to derive the value estimates. To further reduce the bias of the value estimator, we employ a data splitting and cross-fitting strategy, which is commonly used in statistics (Romano & DiCiccio, 2019) . That is, we use different subsets of data samples to learn the discretization D and to construct the value estimator. A pseudocode summarizing our algorithm is given in Algorithm 1 in Appendix A. We present the details below.

3.3. THE COMPLETE ALGORITHM

We present the details for DJQE in this section. It consists of three steps: data splitting, deep jump Q-learning, and cross-fitting. Step 1: Data Splitting: We divide all n samples into L subsets of equal size, where L denotes the indices of samples in the th subset for = 1, • • • , L. Let L c = {1, 2, • • • , n} -L as the complement of L . Step 2: Deep Jump Q-Learning: For each = 1, • • • , L, we propose to apply deep jump Qlearning to compute a discretization D ( ) and { q ( ) I : I ∈ D ( ) } by solving a version of equation 3 using the data subset in L c only. We next present the computational details for solving this optimization. Our approach is motivated by the PELT method (Killick et al., 2012) in multi-scale change point detection. Specifically, for any interval I, define q ( ) I as the minimizer of arg min Q I ∈Q 1 |L c | i∈L c I(A i ∈ I) Q I (X i ) -Y i 2 , where |L c | denotes the number of samples in L c . Define the cost function C ( ) (I) as the minimum value of the objective function 4, i.e, C ( ) (I) = 1 |L c | i∈L c I(A i ∈ I) q ( ) I (X i ) -Y i 2 . In our implementation, we set Q as the class of multilayer perceptrons (MLPs, see Figure 3 in Appendix A for an illustration) with L hidden layers and H nodes each hidden layer. The above optimization can be solved via the MLP regressor implementation of Pedregosa et al. (2011) . Computation of D ( ) relies on dynamic programming (Friedrich et al., 2008) . For any integer 1 ≤ v * < m, denote by B(m, v * ) the set consisting of all possible discretizations D v * of [0, v * /m). Set B(m, m) = B(m), we define the Bellman function as Bell(v * ) = inf D v * ∈B(m,v * ) I∈D v * C ( ) (I) + γ(|D v * | -1) , and Bell(0) = -γ. Our algorithm recursively updates the Bellman function based on the following formula, Bell(v * ) = min v∈R v * Bell(v) + C ( ) ([v/m, v * /m)) + γ , ∀v * ≥ 1, where R v * is the candidate change points list updated by {v ∈ R v * -1 ∪ {v * -1} : Bell(v) + C ( ) ([v/m, (v * -1)/m)) ≤ Bell(v * -1)}, during each iteration with R 0 = {0}. The constraint listed in 6 is important as it facilitates the computation by discarding change points not relevant to obtain the final discretization, leading to a linear computational cost (Killick et al., 2012) . To solve equation 5, we search the optimal change point location v that minimizes Bell(v * ). This requires to apply the MLP regressor to learn q ( ) [v/m,v * /m) and C ( ) ([v/m, v * /m)) for each v ∈ R v * . Let v 1 be the corresponding minimizer. We then define the change points list τ (v * ) = {v 1 , τ (v 1 )}. This procedure is iterated to compute Bell(v * ) and τ (v * ) for v * = 1, . . . , m. The optimal partition D ( ) is determined by the values stored in τ (see Algorithm 1 in Appendix A for details). Step 3. Cross-Fitting: For each interval in the estimated optimal partition D ( ) , we estimate the generalized propensity score function Pr(A ∈ I|X = x) via the MLP regressor using the training dataset L c . Let b ( ) (I|x) denote the resulting estimate. The final estimated value for V (π) is constructed via cross-fitting, given by, V (π) = 1 n L =1 I∈ D ( ) i∈L I(A i ∈ I) I{π(X i ) ∈ I} b ( ) I (I|X i ) Y i -q ( ) I (X i ) + I(A i ∈ I) q ( ) I (X i ) . (7) Note the samples used to construct V inside bracket are independent from those to estimate q ( ) I , b ( ) I and D ( ) . This helps remove the bias induced by overfitting in the estimation of q ( ) I , b ( ) I and D ( ) .

4. THEORY

We investigate the theoretical properties of the proposed estimator. For simplicity, let the support X = [0, 1] p . We will show our estimator is consistent when the Q-function is either a piecewise function or a continuous function of a. Specifically, consider the following two model assumptions. Model 1 (Piece-wise constant function). Suppose Q(x, a) = I∈D0 q I (x)I(a ∈ I), ∀x ∈ X , ∀a ∈ A, for some partition D 0 of [0, 1] and a collection of continuous functions (q I ) I∈D0 . Model 2 (Continuous function). Suppose Q is a continuous function of a and x. We first consider the case where the value function takes the form of equation 8. We remark that kernel-based estimators will fail under this model assumption, as it requires the second-order derivative of the Q-function to exist. Without loss of generality, assume q I1 = q I2 for any two adjacent intervals I 1 , I 2 ∈ D 0 . This guarantees that the representation in equation 8 is unique. For any partition D = {[0, τ 1 ), [τ 1 , τ 2 ), • • • , [τ K , 1]}, we use J(D) to denote the set of change points {τ 1 , • • • , τ K }. We impose the following conditions to establish our theories. Assumption 1. The number of layers L and the number of nodes in each hidden layer H diverge with n, in that HL = O(n ρ ), for some constant 0 < ρ < 1/2. Assumption 2. Functions { q ( ) I (•)} I∈ D ( ) are uniformly bounded. Assumption 1 is mild, as both H and L are parameters we specify. The part that HL = O(n ρ ) ensures that the stochastic error resulting from the parameter estimation in the MLP is negligible. Assumption 2 ensures that the optimizer would not diverge in the ∞ sense. Similar assumptions are commonly imposed in the literature to derive the convergence rates of DNN estimators (see e.g., Farrell et al., 2018) . These two assumptions guarantee the uniform consistency of the DNN estimator { q I } I∈ D ( ) . See Lemma 1 in Appendix D for details. Theorem 1 Suppose Model 1, Assumptions 1 and 2 hold. Suppose m diverges to infinity with n. Then, there exists some constant γ 0 such that as long as 0 < γ ≤ γ 0 , the following events occur with probability approaching 1 (w.p.a.1), (i) | D ( ) | = |D 0 |; (ii) max τ ∈J(D0) min τ ∈J( D ( ) ) |τ -τ | = o p (1). (iii) V (π) = V (π) + o p (1) for any policy π such that for any τ 0 ∈ J(D 0 ), Pr(π(X) ∈ [τ 0 -, τ 0 + ]) → 0 as → 0. Theorem 1 establishes the properties of our method under settings where the Q(a, x) is piecewise function in a. Results in (i) imply that deep jump Q-learning correctly identifies the number of change points. Results in (ii) imply that any change point in D 0 can be consistently identified. In particular, D ( ) corresponds to a subset of {1/m, 2/m, • • • , (m -1)/m}. With a sufficiently large m, for any true change point τ in D 0 , there will be a change point in D ( ) that approaches τ . Consequently, the change point locations can be consistently estimated. To ensure the consistency of the proposed value estimator, we require that the distribution of the random variable π(X) induced by the target policy does not have point-masses at the change point locations. This condition is also mild. For instance, it automatically holds when π(X) has a density function on [0, 1]. Theorem 2 Suppose Model 2, Assumptions 1 and 2 hold. Suppose m diverges to infinity and γ decays to zero. Then we have Theorem 2 establishes the properties of our method under settings where Q is continuous in a. (i) max I∈ D ( ) sup a∈I E| q ( ) I (X) -Q(X, a)| 2 = o(1); (ii) V (π) -V (π) = o p (1) for any π. Results in (i) imply that q ( ) I (•) can be used to uniformly approximate Q(a, •) for any a ∈ I. The consistency of the value in (ii) thus follows. Finally, we conduct in-depth theoretical analysis to demonstrate the advantage of our estimator. Due to space constraint, we present an informal statement below. Details are given in Appendix C. Theorem 3 (Informal Statement) (i) When the Q-function belongs to a class of piecewise constant functions of the action a, the minimax rate of convergence of the proposed value estimator is O p (n -1/2 ). In contrast, the minimax convergence rate of kernel-based estimator is O p (n -1/3 ). (ii) When the Q-function belongs to a class of smooth functions of a, the minimax convergence rate of our estimator is faster than kernel-based estimators if the bandwdith undersmoothes or oversmoothes the data.

5. EXPERIMENTS

In this section, we investigate the finite sample performance of the proposed DJQE on the synthetic and real datasets, in comparison to two kernel-based methods. The computing infrastructure used is a virtual machine in the AWS Platform with 72 processor cores and 144GB memory.

5.1. SYNTHETIC DATA

Synthetic data are generated from the following model: Y |X, A ∼ N {Q(X, A), 1}, b(A|X) ∼ Unif[0, 1] and X (1) , X (2) , . . . , X (p) iid ∼ Unif[-1, 1], where X = [X (1) , X (2) , . . . , X (p) ]. Consider the following different scenarios: S1: Q(x, a) = (1 + x (1) )I(a < 0.35) + (x (1) -x (2) )I(0.35 ≤ a < 0.65) + (1 -x (2) )I(a ≥ 0.65); S2: Q(x, a) = I(a < 0.25) + sin(2πx (1) )I(0.25 ≤ a < 0.5){0.5 -8(x (1) -0.75) 2 }I(0.5 ≤ a < 0.75) + 0.5I(a ≥ 0.75); S3 (toy): Q(x, a) = 10 max{a 2 -0.25, 0} log(x (1) + 2); S4: Q(x, a) = 0.2(8 + 4x (1) -2x (2) -2x (3) ) -2(1 + 0.5x (1) + 0.5x (2) -2a) 2 . The Q-function is a piecewise function of a under Scenarios 1 and 2, and is continuous under Scenarios 3 (toy example considered in Section 3.1) and 4. We set the target policy to be the optimal policy that achieves the highest possible mean reward. We list the oracle mean value under the optimal policy for each scenario in the first column of Table 4 in Appendix B. We apply the proposed DJQE and two kernel-based methods (Kallus & Zhou, 2018; Colangelo & Lee, 2020) to Scenario 1-4 with 20-dimensional covariates (p = 20) and n ∈ {50, 100, 200, 300}. For the DJQE, we select γ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}n 0.4 based on five-fold cross-validation. Here, we set m = n/10 to achieve a good balance between the bias and the computational cost (see Figure 4 in Appendix B for the detailed computational cost of the DJQE and the resulting bias as a function of m in Scenario 1 with n = 100). We find it extremely computationally intensive to compute the optimal bandwidth h * using Kallus & Zhou (2018)'s method (see the detailed comparison of computational cost under different methods based on Scenario 1 in Table 3 ). Thus, as suggested in Kallus & Zhou (2018) , we first compute h * using data with sample size n 0 = 50. To accommodate data with different sample sizes n, we adjust h * by setting h * {n 0 /n} 0.2 . To implement Colangelo & Lee (2020)'s estimator, we consider a list of bandwidths, given by h = cσ A n -0.2 with c ∈ {0.5, 0.75, 1.0, 1.5} and σ A is the sample standard deviation of the action. We then manually select the best bandwidth such that the resulting value estimator achieves the smallest mean squared error. The conditional mean value and generalized propensity score are fitted via the MLP regressor with 10 hidden layer and 10 neurons in each layer. The average estimated value and its standard deviation over 100 replicates are illustrated in Figure 2 for different methods, with detailed values reported in Table 4 in Appendix B. In addition, we provide the size of the final estimated partition under the DJQE in Table 5 in Appendix B, which is much smaller than m in most cases. It can be seen from Figure 2 that the proposed DJQE is very efficient and outperforms all competing methods in almost all cases. We note that the proposed method performs reasonably well even when the sample size is small (n = 50). In contrast, two kernel-based methods fail to accurately estimate the value even in some cases when n = 300. Among the two kernel-based OPE approaches, we observe that the method developed by Kallus & Zhou (2018) performs better in general.

5.2. REAL DATA: PERSONALIZED DOSE FINDING

Warfarin is commonly used for preventing thrombosis and thromboembolism. We use the dataset provided by the International Warfarin Pharmacogenetics (Consortium, 2009) for analysis. We choose p = 81 baseline covariates considered in Kallus & Zhou (2018) . This yields a total of 3964 with complete records of baseline information. The response is defined as the absolute distance between the international normalized ratio (INR, a measurement of the time it takes for the blood to clot) after the treatment and the ideal value 2.5, i.e, Y = -|INR-2.5|. We use the min-max normalization to convert the range of the dose level A into [0, 1]. To compare among different methods, we calibrate the dataset to generate simulated outcomes. Specifically, we first estimate the Q-function via a MLP regressor with 10 hidden layers and 50 neurons in each layer using the whole dataset. The goodness-of-the-fit of the fitted model under the MLP regressor is reported in Table 6 in Appendix B. We next use the fitted Q-function Q(X, A) to simulate the data. Given a randomly sampled feature-action pair (a j , x j ) from {(A 1 , X 1 ), • • • , (A n , X n )}, we set the reward r j to N { Q(x j , a j ), σ 2 }, where σ is the standard deviation of the fitted residual {Y i -Q(X i , A i )} i . Given the simulated data {(x j , a j , r j ) : 1 ≤ j ≤ n}, we are interested in evaluating the optimal policy: π (X) ≡ arg max a∈[0,1] Q(X, a). The oracle value under the optimal policy is V = -0.278. We apply the DJQE on the calibrated Warfarin data, against two kernel-based methods. Due to the extremely intensive computation in Kallus & Zhou (2018) , we directly apply the estimated optimal bandwidth h * in their real data analysis, since they used the same dataset. Biases, standard deviations, and mean squared errors of the estimated values are reported in Table 2 over 20 replicates for sample size n = 500 under different methods. It can be observed from Table 2 that our proposed DJQE achieves much smaller mean squared error than the two kernel-based methods, when evaluating the optimal policy. Specifically, the DJQE yields bias as 0.259 with the standard deviation as 0.416, in contrast to the large bias as 0.662 with the standard deviation as 0.742 under Kallus & Zhou (2018) 's method and the bias as 0.442 with the large standard deviation as 1.164 under Colangelo & Lee (2020)'s method. Therefore, our proposed DJQE for off-policy evaluation with continuous actions works better than the kernel-based methods.

6. DISCUSSION

Currently, we focus on settings with a single decision point. It would be practically interesting to extend our proposal to sequential decision making. A potential drawback of our method is that it would be very computationally intensive for a large m as the runtime increases linearly in m.

A MORE ON THE IMPLEMENTATION

We summarize our algorithm in Algorithm 1. Global: data {(X i , A i , Y i )} 1≤i≤n ; number of initial intervals m; penalty term γ; target policy π. Local: an upper triangular matrix of cost C ∈ R m(m+1)/2 ; Bellman function Bell ∈ R m ; partitions D; DNN functions { q I , b I : I ∈ D}; a vector τ ∈ N m ; a set of candidate point lists R. Output: the value estimator for target policy V (π). I. Split all n samples into L subsets as {L 1 , • • • , L L }; V (π) ← 0; II. Initialize an even segment on the action space with m pieces: {I} = {[0, 1/m), [1/m, 2/m), . . . , [(m -1)/m, 1]}; III. For = 1, • • • , L: 1. Set the training dataset as L c = {1, 2, • • • , n} -L ; 2. Bell(0) ← -γ; D = [0, 1]; τ ← N ull; R(0) ← {0}; 3. Collect cost function: For r = 1, . . . , m: For l = 0, . . . , (r -1): (i). Let I = [l/m, r/m) if r < m else I = [l/m, 1]; (ii). Fit a MLP regressor: q I (•) ← I(i ∈ L c )I(A i ∈ I)Y i ∼ I(A i ∈ I)M LP (X i ); (iii). Calculate the cost: C(I) ← i∈L c I(A i ∈ I) q I (X i ) -Y i 2 ; 4. Apply the PELT method to get partitions: For v * = 1, . . . , m: (iv) r ← l; l ← τ (r); 6. Evaluation using testing dataset L : (i).Bell(v * ) = min v∈R(v * ) {Bell(v) + C([v/m, v * /m)) + γ}; (ii). v 1 ← arg min v∈R(v * ) {Bell(v) + C([v/m, v * /m)) + γ}; (iii). τ (v * ) ← {v 1 , τ (v 1 )}; (iv). R(v * ) ← {v ∈ R(v * -1) ∪ {v * -1} : Bell(v) + C([v/m, (v * -1)/m)) ≤ Bell(v * - V (π)+ = I∈ D i∈L I(A i ∈ I) I{π(Xi)∈I} b I (I|Xi) Y i -q I (X i ) + q I (X i ) ; return V (π)/n . Algorithm 1: Deep Jump Q-Evaluation Figure 3 : Illustration of multilayer perceptron with two hidden layers and three nodes each hidden layer. Here µ is the input, A (l) and b (l) denote the corresponding parameters to produce the linear transformation for the (l -1)th layer. 

B ADDITIONAL EXPERIMENTAL RESULTS

We include additional experimental results in this section. 

C ADDITIONAL THEORETICAL RESULTS

In this section, we conduct in-depth theoretical analysis to compare the minimax convergence rate of the proposed value estimator with kernel-based value estimator. We first briefly summarize our theoretical findings. When the Q-function belongs to a class of piecewise constant functions of the action a, the proposed estimator converges at a faster rate than kernel-based estimators. Specifically, the kernel-based OPE converges at a rate of n -1/3 whereas our estimator converges at a rate of n -1/2 . When the Q-function belongs to a class of Lipschitz continuous functions of a, our estimator converges at a rate of n -1/5 whereas kernel-based estimators converge at a slower rate when the bandwdith undersmoothes or oversmoothes the data. As we have commented, it remains extremely challenging to tune the bandwidth of kernel-based OPE in practice, since OPE is an unsupervised problem. The use of a suboptimal bandwidth would lead to a slower rate of convergence for kernel-based estimators. In contrast, we develop a supervised learning algorithm to adaptively discretize the action space. The tuning parameters in our procedure can be selected via cross-validation. All the above findings are supported by our observations in the toy example (Section 3.1) and numerical experiments (Section 5) as well. We next present our theoretical results. We first introduce some notations. Define the following classes of Q-functions: Q 1 =    Q : Q(a, x) = I∈D0 a∈I I(a ∈ I)q I (x), max I1 =I2 I1,I2∈D0 E|q I1 (X) -q I2 (X)| 2 ≥ , |D 0 | ≤ C 1 , Q(a, •) ∈ Λ(β, C 2 ), ∀a} , Q 2 = Q : sup a1,a2,x |Q(a 1 , x) -Q(a 2 , x)| ≤ C 3 , Q(a, •) ∈ Λ(β, C 2 ), ∀a , for some sufficiently small constant > 0 and some sufficiently large constants C 1 , C 2 > 0 where Λ(β, C 2 ) denotes the class of β-smooth functions (see e.g., Section 3.1.1, Shi et al., 2020) . By definition, the first set Q 1 consists of all piecewise constant functions of a with at most C 1 -1 change points. The second set Q 2 contains the class of Lipschitz continuous functions of a. Finally, define the policy class Π = {π : π(X) has a density function bounded by C 4 } for some sufficiently large constant C 4 > 0. To simplify the analysis, we assume the behavior policy is Theorem 3 Suppose 2β > p. Then for any π ∈ Π and any Q ∈ Q 1 , with proper choice of γ and the MLP class, the minimax rate of convergence of the proposed value estimator is O p (n -1/2 ). In contrast, the minimax convergence rate of kernel-based estimator is O p (n -1/3 ). Suppose 4β > 3p. For any Q ∈ Q 2 , with proper choice of γ and the MLP class, the minimax rate of convergence of the proposed value estimator is O p (n -1/5 ), up to some logarithmic factors. In contrast, the minimax convergence rate of kernel-based estimator is slower when h n -κ for some κ < 1/5 or κ > 3/5. 5 M SE = 1 n n i=1 (Yi -Yi) 2 . We briefly comment on the condition 2β > p. This condition allows that the MLP regressor to converge at a rate faster than n -1/4 (see e.g., Imaizumi & Fukumizu, 2019) . Such a condition is commonly assumed in the literature on evaluating average treatment effects (see e.g., Chernozhukov et al., 2017; Farrell et al., 2018) . In the second part of Theorem 3, we further require the MLP regressor to converge at a rate faster than n -3/10 .

D TECHNICAL PROOF

Throughout the proof, we use c, C, c 0 , c, etc., to denote some universal constants whose values are allowed to change from place to place. Let O i = {X i , Y i } denote the data summarized from the ith observation. For any interval I, let b(I|x) denote the integral a∈I b(a|x)da. Proofs of Theorems 1 and 2 rely on Lemmas 1 and 2. Specifically, Lemma 1 establishes the uniform convergence rate of q I for any I whose length is no shorter than o(γ) and belongs to the set of intervals: I(m) = {[i 1 /m, i 2 /m) : for some integers i 1 and i 2 that satisfy 0 ≤ i 1 < i 2 < m} ∪ {[i 3 /m, ] : for some integers i 3 that satisfy 0 ≤ i 3 < m}. To state this lemma, we first introduce some notations. For any such interval I, define the function q I (x) = E(Y |A ∈ I, X = x). It is immediate to see that the definition of q I here is consistent with the one defined in equation 8 for any I ⊆ D 0 . Lemma 1 The following holds when either the conditions in Theorem 1 or 2 hold: max I∈I(m),|I|≥cγ E[|q I (X) -q ( ) I (X)| 2 |{O i } i∈L ] = o p (1), for any positive constant c > 0. x Lemma 2 When either the conditions in Theorem 1 or 2 hold, min I∈ D ( ) |I| ≥ Cγ w.p.a.1 for some constant C > 0. We first present the proofs for these two lemmas. Next we present the proofs for Theorems 1 and 2.

D.1 PROOF OF LEMMA 1

For any sufficiently small constant > 0, it suffices to show max I∈I(m),|I|≥cγ E[|q I (X) -q ( ) I (X)| 2 |{O i } i∈L ] ≤ C 2 , w.p.a.1, for some constant C > 1 whose value will be specified later. The proof is divided into two parts. In the first part, we show there exist some constants {θ * I } I such that max I∈I(m) max x∈X |q I (x) -MLP(x; θ * I )| ≤ . ( ) In the second part, we show max I∈I(m),|I|≥cγ E[|MLP(X; θ * I ) -q ( ) I (X)| 2 |{O i } i∈L ] ≤ (C -1) 2 . ( ) The proof is hence completed. Part 1. Under the step model assumption, for any I 0 ∈ D 0 , q I0 (•) is continuous. Since the support X is compact, q I0 (•) is uniformly continuous. Similarly, we can show b is uniformly continuous as well. For any I ∈ I m , we have q I (x) = a∈I I0∈D0 b(a|x)I(a ∈ I 0 )q I0 (x)da b(I|x) . It follows from the uniform continuity of q I0 , b and the positivity assumption that q I (•) is continuous for any I ∈ I(m). Similarly, under the model assumption in Theorem 2, we can show q I (•) is continuous for any I ∈ I(m) as well. Consequently, when either the conditions in Theorem 1 or 2 hold, we obtain that q I (•) is continuous for any I ∈ I(m). By Stone-Weierstrass theorem, there exists a multivariate polynomial function q * I such that the absolute value of the residual q I -q * I is uniformly bounded by /2. By Theorem 1 of Yarotsky ( 2017), there exists a feedforward neural network with a bounded number of hidden units that uniformly approximates q * I , with the approximation error uniformly bounded by /2 in absolute value. By Lemma 1 of Farrell et al. (2018) , such a feedforward network can be embedded into an MLP with a bounded number of hidden units. Since we allow H and L to diverge, such an MLP can be further embedded into an MLP with L layers and the widths of all layers being proportional to H. This yields equation 10. The proof for Part 1 is thus completed. Part 2. We aim to show equation 11 holds. Under the boundedness assumption on Y , {q I (•)} I∈I(m) are uniformly bounded. We first observe that MLP(•, θ * I ) is a bounded function, by equation 10 and the fact that q I (•) is bounded. Next, it follows from equation 10 that EI(A ∈ I){Y -MLP(X; θ * I )} 2 Pr(A ∈ I) = EI(A ∈ I){Y -E(Y |A ∈ I, X)} 2 Pr(A ∈ I) + EI(A ∈ I){E(Y |A ∈ I, X) -MLP(X; θ * I )} 2 Pr(A ∈ I) ≤ EI(A ∈ I){Y -E(Y |A ∈ I, X)} 2 Pr(A ∈ I) + 2 . ( ) By definition, i∈L I(A i ∈ I){Y i -MLP(X i ; θ * I )} 2 Pr(A ∈ I) ≥ i∈L I(A i ∈ I){Y i -MLP(X i ; θ ( ) I )} 2 Pr(A ∈ I) . ( ) Suppose we can show sup I∈I(m),|I|≥cγ θ I ∈Θ * i∈L I(A i ∈ I){Y i -MLP(X i ; θ I )} 2 |L |Pr(A ∈ I) - EI(A ∈ I){Y -MLP(X; θ I )} 2 Pr(A ∈ I) = o p (1), where the set Θ * consists of all θ I such that sup x |MLP(x; θ I )| ≤ M for some sufficiently large constant M such that both θ * I and θ  I )} 2 Pr(A ∈ I) = EI(A ∈ I){Y -E(Y |A ∈ I, X)} 2 Pr(A ∈ I) + EI(A ∈ I){E(Y |A ∈ I, X) -MLP(X; θ ( ) I )} 2 Pr(A ∈ I) .

It follows that

EI(A ∈ I){E(Y |A ∈ I, X) -MLP(X; θ ( ) I )} 2 Pr(A ∈ I) = Eb(I|X){E(Y |A ∈ I, X) -MLP(X; θ ( ) I )} 2 Pr(A ∈ I) ≤ 2 2 . Under the positivity assumption, sup x b(I|x)/Pr(A ∈ I) ≥ 2(c -1) -1 for some constant c > 1. This yields equation 11. To complete the proof, it remains to show equation 14 holds. Using similar arguments in Section A.2.2 of Farrell et al. (2018) , for any I, we have with probability at least 1 -2 exp(-γ) that i∈L I(A i ∈ I){Y i -MLP(X i ; θ I )} 2 |L |Pr(A ∈ I) - EI(A ∈ I){Y -MLP(X; θ I )} 2 Pr(A ∈ I) ≤ c M γ n + γ n σ(I, θ I ) + σ(I, θ I ) H 2 L 2 n log M σ(I, θ I ) + log n , for some constant c > 0, where σ 2 (I, θ) corresponds to the variance of 1 Pr(A ∈ I) I(A ∈ I){Y -MLP(X; θ)} 2 . Under the positivity assumption, we have σ 2 (I, θ) ≤ O(1)|I| -1 where O(1) denotes some positive constant. Consequently, for any I whose length is greater than cγ -1 , we have with probability at least 1 -2 exp(-γ) that i∈L I(A i ∈ I){Y i -MLP(X i ; θ I )} 2 |L |Pr(A ∈ I) - EI(A ∈ I){Y -MLP(X; θ I )} 2 Pr(A ∈ I) ≤ c0 M γ n + γ nγ + H 2 L 2 nγ (log γ + log n) , for some constant c0 > 0. Note that the number of elements in I(m) is upper bounded by (m + 1) 2 . Since m is proportional to n, by setting γ = c * log n for some constant c * > 0, we have 1 -2(m + 1) 2 exp(-γ) → 1. Consequently, it follows from Bonferroni's inequality that the following event occurs w.p.a.1 for any I ∈ I(m) such that |I| ≥ cγ, i∈L I(A i ∈ I){Y i -MLP(X i ; θ I )} 2 |L |Pr(A ∈ I) - EI(A ∈ I){Y -MLP(X; θ I )} 2 Pr(A ∈ I) ≤ O(1) M log n n + log n nγ + H 2 L 2 nγ log n , where O(1) denotes some positive constant. Under the given conditions, the RHS is o(1). The proof is hence completed.

D.2 PROOF OF LEMMA 2

Consider a given interval I ∈ D ( ) . We can find some interval I ∈ I(m) ∩ D ( ) that is adjacent to I. Consequently, the interval I ∪ I belongs to I(m) as well. It follows that 1 |L | i∈L I(A i ∈ I){Y i -q ( ) I (X i )} 2 + 1 |L | i∈L I(A i ∈ I ){Y i -q ( ) I (X i )} 2 ≤ 1 |L | i∈L I(A i ∈ I ∪ I ){Y i -q ( ) I∪I (X i )} 2 -γ. (16) By definition, q ( ) I∪I minimizes the loss i∈Li I(A i ∈ I ∪ I ){Y i -q ( ) I∪I (X i )} 2 . It follows that i∈L I(A i ∈ I ∪ I ){Y i -q ( ) I∪I (X i )} 2 ≤ i∈L I(A i ∈ I ∪ I ){Y i -q ( ) I (X i )} 2 . Combining this together with equation 16 yields that γ ≤ 1 |L | i∈L I(A i ∈ I){Y i -q ( ) I (X i )} 2 . ( ) As both Y and q ( ) I are uniformly bounded, we have γ ≤ c|L | -1 i∈L I(A i ∈ I) for some constant c > 0. By Bernstein's inequality, we have for any t > 0 such that 1 |L | i∈L I(A i ∈ I) ≤ Pr(A ∈ I) + t, with probability 1 -exp(-nt 2 /c(|I| + t)) for some constant c > 0. Similar to equation 15, by setting t I = c0 max( |I|n -1 log n, n -1 log n) for some constant c0 > 0, we obtain w.p.a.1 that 1 |L | i∈L I(A i ∈ I) ≤ Pr(A ∈ I) + t I , for any I ∈ I(m). Under the condition that b is continuous and A × X is compact, b is bounded. Consequently, the probability Pr(A ∈ I) is proportional to the length of I. In view of equation 18, for any I ∈ D ( ) , it shall satisfy γ c1 ≤ |I| + log n n + 2 |I| log n n ≤ 2|I| + 2 log n n , for some constant c1 > 0, w.p.a.1, where the last inequality follows from Cauchy-Schwarz inequality. As γ n -1 log n, we obtain |I| ≥ γ/c 2 for any I ∈ D ( ) and some constant c2 > 0, w.p.a.1. The proof is hence completed.

D.3 PROOF OF THEOREM 1

We begin with an outline of the proof. The proof is divide into four steps. In the first step, we show Pr(| D ( ) | ≤ |D |) → 1. In the second step, we show max τ ∈J(D0) min τ ∈J( D ( ) ) | τ -τ | < δ min , where δ min = min I∈D0 |I|/3. By the definition of δ min , this implies that Pr(| D ( ) | ≥ |D |) → 1. Combining equation 22 together with equation 19 proves (i) in Theorem 1. This proves (ii) in Theorem 1. In the third step, we show max τ ∈J(D0) min τ ∈J( D ( ) ) | τ -τ | = o p (1). In the last step, we show (iii) holds. The proof is thus completed. We next detail the proof for each of the step. Step 1. Assume |D 0 | > 1. Otherwise, equation 19 automatically holds. Consider the partition D = {[0, 1]} which consists of a single interval and a zero q-function q [0,1] (x) = 0 for any x. By definition, we have I∈ D ( ) i∈L I(A i ∈ I){Y i -q I (X i )} 2 + |L |γ| D ( ) | ≤ i∈L Y 2 i + |L |γ. Since Y is uniformly bounded and γ = o(1), the right-hand-side (RHS) is O(n). Consequently, we obtain | D ( ) | ≤ c 0 γ -1 , for some constant c 0 > 0. Notice that I∈ D ( ) i∈L I(A i ∈ I){Y i -q ( ) I (X i )} 2 ≥ I∈ D ( ) i∈L I(A i ∈ I){Y i -q I (X i )} 2 η1 + I∈ D ( ) i∈L I(A i ∈ I){ q ( ) I (X i ) -q I (X i )} 2 η2 -2 I∈ D ( ) i∈L I(A i ∈ I){Y i -q I (X i )}{ q ( ) I (X i ) -q I (X i )} η3 . We next show η 2 , η 3 = o p (1). Consider η 2 first. Under Lemma 2, we have w.p.a.1 that η 2 = I∈ D ( ) ,|I|≥Cγ i∈L I(A i ∈ I){ q ( ) I (X i ) -q I (X i )} 2 . We decompose the RHS by I∈ D ( ) ,|I|≥Cγ |L |E[I(A ∈ I){ q ( ) I (X) -q I (X)} 2 |{O i } i∈Li ] + I∈ D ( ) ,|I|≥Cγ i∈L [I(A i ∈ I){ q ( ) I (X i ) -q I (X i )} 2 -E{I(A ∈ I){ q ( ) I (X) -q I (X)} 2 |{O i } i∈Li }]. The first line is o p (n), by Lemma 1. Using similar arguments in equation 15, we can show the second term is upper bounded by c|L | I∈ D ( ) ,|I|≥Cγ M log n n + |I|H 2 L 2 log n n , for some constant c > 0, w.p.a.1. In view of equation 23, the above expression can be further bounded by cn    c 0 M log n γn + I∈ D ( ) |I|H 2 L 2 log n n    ≤ O(1)n log n γn + H 2 L 2 log n γn , where O(1) denotes some positive constant and the second inequality follows from the Cauchy-Schwarz inequality. This yields that η 2 = o p (n). Using similar arguments, we can show that η 3 = o p (n). It follows that I∈ D ( ) i∈L I(A i ∈ I){Y i -q ( ) I (X i )} 2 ≥ η 1 + o p (n). Notice that η 1 = I∈ D ( ) i∈L I(A i ∈ I){Y i -Q(A i , X i ) + Q(A i , X i ) -q I (X i )} 2 = I∈ D ( ) i∈L I(A i ∈ I){Y i -Q(A i , X i )} 2 η4 + I∈ D ( ) i∈L I(A i ∈ I){Q(A i , X i ) -q I (X i )} 2 η5 + 2 I∈ D ( ) i∈L I(A i ∈ I){Y i -Q(A i , X i )}{Q(A i , X i ) -q I (X i )} η6 . Using similar arguments in bounding η 2 and η 3 , we can show η 6 = o p (n) and that η 5 = |L | I∈ D ( ) I Eb(a|X)|Q(X, a) -q I (X)| 2 da + o p (n). The first term on the RHS is greater than c 0 n I∈ D ( ) I E|Q(X, a)-q I (X)| 2 da for some constant c 0 > 0. It follows that η 1 ≥ η 4 + c 0 |L | I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da + o p (n). Note that under the step model assumption, η 4 can be rewritten as I∈D0 i∈L I(A i ∈ I){Y i - q I (X i )} 2 . This together with equation 25, yields that . Set τ * 0,0 = 0, τ * 0,|D0| = 1 and q * I∈ D ( ) i∈L I(A i ∈ I)(Y i -q ( ) I (X i )) 2 ≥ I∈D0 i∈L I(A i ∈ I)(Y i -q I (X i )) 2 + c 0 |L | I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da + o p (n), [τ * 0,k-1 ,τ * 0,k ) = q [τ 0,k-1 ,τ 0,k ) for 1 ≤ k ≤ |D 0 | -1 and q * [τ * 0,K-1 ,1] = q [τ 0,K-1 ,1] . Let ∆ k = [τ * 0,k-1 , τ * 0,k ) ∩ [τ 0,k-1 , τ 0,k ) c for 1 ≤ k ≤ |D 0 | -1 and ∆ |D0| = [τ * 0,|D0|-1 , 1] ∩ [τ 0,|D0|-1 , 1] c . The length of each interval ∆ k is at most m -1 . It follows that I∈D * i∈L I(A i ∈ I){Y i -q * I (X i )} 2 - I∈P0 i∈L I(A i ∈ I){Y i -q I (X i )} 2 ≤ |D0| k=1 i∈Li I(A i ∈ ∆ k ) Y 2 i + sup I⊆[0,1] sup x q 2 I (x) . Using similar arguments in bounding the RHS of equation 17 in the proof of Lemma 2, we can show the RHS of equation 26 is o p (n), as m diverges to infinity. Combining this together with equation 25 yields I∈ D ( ) i∈L I(A i ∈ I){Y i -q ( ) I (X i )} 2 ≥ I∈D * i∈L I(A i ∈ I){Y i -q * I (X i )} 2 +c 0 |L | I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da + o p (n). By definition, we have This completes the proof for the first step. I∈ D ( ) i∈L I(A i ∈ I)(Y i -q ( ) I (X i )) 2 + |L |γ| D ( ) | ≤ I∈P * i∈L I(A i ∈ I){Y i -q * I (X i )} 2 + |L |γ|D * |. Step 2. It follows from equation 28 that I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da ≤ γ{|D 0 | -1} + o p (1), and hence I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da ≤ 2γ{|D 0 | -1}, w.p.a.1. We aim to show equation 20 holds under the event defined in equation 29. Otherwise, there exists some τ 0 ∈ J(P 0 ) such that |τ -τ 0 | ≥ δ min , for all τ ∈ J( P ). Under the event defined in equation 29, we obtain that τ0+δmin τ0-δmin E|Q(X, a) -q I (X)| 2 da ≤ 2γ(|D 0 | -1), w.p.a.1. On the other hand, since Q(X, a) is a constant function on [τ 0 -δ min , τ 0 ) or [τ 0 , τ 0 + δ min ), we have τ0+δmin τ0-δmin E|Q(X, a) -q I (X)| 2 da ≥ min q δ min E|q [τ0-δmin,τ0) (X) -q(X)| 2 + δ min E|q [τ0,τ0+δmin) (X) -q(X)| 2 2 ≥ δ min 2 E|q [τ0-δmin,τ0) (X) -q [τ0,τ0+δmin) (X)| 2 ≥ δ min κ 0 2 , where κ 0 ≡ min I1,I2∈P0 I1 and I2 are adjacent E|q I1 (X) -q I2 (X)| 2 2 > 0. This apparently violates equation 30 where γ ≤ δ min κ 0 /(4|D 0 -4|). equation 20 thus holds w.p.a.1. Step Step 4: We begin with some notations. For any π, we define a random policy π D ( ) according to the partition D ( ) as follows: π D ( ) (a|x) = I⊆ D ( ) I{π(x) ∈ I, a ∈ I} b(a|x) b(I|x) . Note that 1 0 π D ( ) (a|x)da = I⊆ D ( ) I{π(x) ∈ I} = 1 for any x. Consequently, π D ( ) is a "valid" random policy. The proposed value estimator V ( ) is doubly-robust to V (π D ( ) ). By Lemma 1, the estimated Qfunction is consistent. Consequently, V ( ) is consistent to V (π D ( ) ). Since the propose estimator is a weighted average of V ( ) , it suffices to show V (π D ( ) ) is consistent to V (π). Note that V (π D ( ) ) = E [0,1] Q(X, a) I⊆ D ( ) I{π(X) ∈ I, a ∈ I} b(a|X) b(I|X) da = I0∈D0 I0 Eq I0 (X) I⊆ D ( ) I{π(X) ∈ I} b(I ∩ I 0 |X) b(I|X) . Similarly, we can show V (π) = I0∈D0 I0 Eq I0 (X)I{π(X) ∈ I 0 }. For each I 0 ∈ D 0 , there exists an interval I ∈ D ( ) such that |I ∪ I 0 |/|I| → 1. We use I ( ) 0 to denote this interval. Under the given conditions, we have V (π D ( ) ) = I0∈D0 I0 Eq I0 (X)I{π(X) ∈ I The first two terms decay to zero under the given conditions on π(X). The proof is hence completed.

D.4 PROOF OF THEOREM 2

We need the following lemma to prove Theorem 2. The rest of the proof is divided into three steps. In the first step, we show Assertion (i) in Theorem 2 holds. In the second step, we show Assertion (ii) in Theorem 2 holds. Finally, we present the proof for Lemma 3. Step 1. Consider a sequence {d n } n such that d n → 0 and d n γ. We aim to show inf a∈I It follows that I ∈ D ( ) E[|Q(X, a) -q I (X)| 2 |{O i } i∈L ] = o p (1). By Lemma 1, it suffices to show inf a∈I I ∈ D ( ) E|Q(X, a) -q I (X)| 2 = o(1). |V (π D ( ) ) -V (π)| ≤ I∈ D ( ) inf a ∈I I E|Q(a , X) -Q(π(X), X)|I(π(X) ∈ I) b(a|X) b(I|X) da = I∈ D ( ) inf a ∈I E|Q(a , X) -Q(π(X), X)|I(π(X) ∈ I) ≤ inf a ,a ∈I,I∈ D ( ) E|Q(a , X) -Q(a , X)|. In Step 1 of the proof, we have shown that inf a∈I,I∈ D ( ) E|Q(X, a) -q I (X)| 2 = o(1). It follows that inf a,a ∈I,I∈ D ( ) E|Q(X, a) -Q(a , X)| 2 = o(1) and hence inf a ,a ∈I,I∈ D ( ) E|Q(a , X) -Q(a , X)| = o(1), by Cauchy-Schwarz inequality. This completes the proof for the second step. Step 3. For a given interval I ∈ D ( ) , the set of intervals I considered in Lemma 3 can be classified into the following three categories. Category 1: I = I . It is immediate to see that q I = q I and the assertion automatically holds.  |L | i∈L I0∈ D ( ) * I(A i ∈ I 0 ){Y i -q I0 (X i )} 2 + γ| D ( ) * | ≥ 1 |L | i∈L I0∈ D ( ) I(A i ∈ I 0 ){Y i -q I0 (X i )} 2 + γ| D ( ) |, and hence 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 + 1 |L | i∈L I(A i ∈ I * ){Y i -q I * (X i )} 2 ≥ 1 |L | i∈L I(A i ∈ I ){Y i -q I (X i )} 2 -γ.

It follows from the definition of q

I * that 1 |L | i∈L I(A i ∈ I * ){Y i -q I * (X i )} 2 ≤ 1 |L | i∈L I(A i ∈ I * ){Y i -q I (X i )} 2 . Therefore, we obtain 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 ≥ 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 -γ. ( i∈L I(A i ∈ I){Y i -q I (X i )} 2 ≥ 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 -2γ. Hence, regardless of whether I belongs to Category 2, or it belongs to Category 3, we have 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 ≥ 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 -2γ. Using similar arguments in equation 15, we can show w.p.a.1 that 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 = E[I(A ∈ I){Y -q I (X)} 2 |{O i } i∈L ] + o( γ|I|), 1 |L | i∈L I(A i ∈ I){Y i -q I (X i )} 2 = E[I(A ∈ I){Y -q I (X)} 2 |{O i } i∈L ] + o( γ|I|), where the little-o terms are uniform in I and I . Combining these together with equation 33 yields E[I(A ∈ I){Y -q I (X)} 2 |{O i } i∈L ] ≥ E[I(A ∈ I){Y -q I (X)} 2 |{O i } i∈L ] -2γ + o( γ|I|), for any I and I , w.p.a.1. Note that q I satisfies E[I(A ∈ I){Y -q I (X)}|X] = 0. We have E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] ≥ E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] -2γ + o( γ|I|). Consider the first term on the RHS. Note that E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] = E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] +E[I(A ∈ I){ q I (X) -q I (X)} 2 |{O i } i∈L ] -2E[I(A ∈ I){q I (X) -q I (X)}{ q I (X) -q I (X)}|{O i } i∈L ]. By Cauchy-Schwarz inequality, the last term on the RHS can be lower bounded by - 1 2 E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] -2E[I(A ∈ I){ q I (X) -q I (X)} 2 |{O i } i∈L ]. It follows that E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] ≥ 1 2 E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] -3E[I(A ∈ I){ q I (X) -q I (X)} 2 |{O i } i∈L ], and hence 1 2 E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] -2γ + o( γ|I|) ≤ E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] +3E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ]. By Lemma 1 and the positivity assumption, the RHS is o p (|I|). Note that the little-o terms are uniform in I and I . As |I| γ, we obtain that By the positivity assumption, we have E[I(A ∈ I){q I (X) -q I (X)} 2 |{O i } i∈L ] = o p E[{q I (X) -q I (X)} 2 |{O i } i∈L ] = o p 1), uniformly for any I and I , or equivalently, E{q I (X) -q I (X)} 2 = o p (1). This yields that E{q I (X) -q I (X)} 2 = o(1), uniformly for any I and I . The proof is hence completed.

D.5 PROOF OF THEOREM 3

In the first two steps, we compare the minimax convergence rate when Q ∈ Q 1 . In the next two steps, we compare the minimax convergence rate when Q ∈ Q 2 . Step 1: We provide a lower bound for the minimax convergence of kernel-based OPE when Q ∈ Q 1 in this step. Consider a piecewise constant Q-function Q(a, x) = 0, if a ≤ 1/2, 1, otherwise. Apparently, we have Q ∈ Q 1 when C 2 ≥ 1 ≥ . Define a policy π such that the density function of π(X) equals 4/3, if 1/4 ≤ π(x) ≤ 1/2, 2/3, else if 1/2 ≤ π(x) < 4/3, 0, otherwise. We aim to show for such Q and π, the best possible convergence rate of kernel-based estimator is n -1/3 . We first consider its variance. Since the conditional variance of Y |A, X is uniformly bounded away from 0 and 1, similar to Theorem 1 of Colangelo & Lee (2020) , we can show the variance is lower bounded by O(1)(nh) -1 where O(1) denotes some positive constant. We next consider its bias. Since the behavior policy is know. The bias is equal to E K{(A -π(X))/h} hb(A|X) {Y -Q(π(X), X)} = E K{(A -π(X))/h} hb(A|X) {Q(A, X) -Q(π(X), X)} = E π(X)+h/2 π(X)-h/2 K a -π(X) h {I(π(X) ≤ 1/2 < a) -I(a ≤ 1/2 < π(X))}da. Using the change of variable a = ht + π(X), the bias equals E 1/2 -1/2 K(t){I(π(X) ≤ 1/2 < π(X) + ht) -I(π(X) + ht ≤ 1/2 < π(X))}dt. Consider any 0 < h ≤ for some sufficiently small > 0. The bias is then equal to 4 3 1/2 1/2-/2 1/2 -1/2 K(t){I(a ≤ 1/2 < a + ht) -I(a + ht ≤ 1/2 < a)}dtda + 2 3 1/2+ /2 1/2 1/2 -1/2 K(t){I(a ≤ 1/2 < a + ht) -I(a + ht ≤ 1/2 < a)}dtda. Under the symmetric condition on the kernel function, the above quantity is equal to 2 3 1/2 1/2-h/2 1/2 (1-2a)/2h K(t)dtda ≥ 2 3 1/2-h/4 1/2-h/2 1/2 (1-2a)/2h K(t)dtda ≥ 2 3 1/2-h/4 1/2-h/2 1/2 1/4 K(t)dtda = h 6 1/2 1/4 K(t)dt. Consequently, the bias is lower bounded by O(1)h where O(1) denotes some positive constant. To summarize, the root mean squared error of kernel based estimator is lower bounded by O(1){(nh) -1/2 + h} where O(1) denotes some positive constant. The optimal choice of h that minimizes such lower bound would be of the order O(n -1/3 ). Consequently, the convergence rate is lower bounded by O(1)n -1/3 . Step 2: We derive an upper bound for the minimax convergence rate of our estimator when Q ∈ Q 1 in this step. Using similar arguments in the proof of Lemma 1 (see Section D.1) and the proof of Theorem 1 in Imaizumi & Fukumizu (2019) , we can show that with proper choices of the MLP networks, the following holds with probability at least 1 -O(n -C ) for some sufficiently large constant C > 0, max I∈I(m),|I|≥cγ E[|q I (X) -q ( ) I (X)| 2 |{O i } i∈L ] ≤ C(n|I|) -2β/(2β+d) log 2 n, ( ) where β is defined in Q 1 and Q 2 . By equation 34 and the condition that γ n -2β/(2β+p) log 2 n, using similar arguments in Steps 1 and 2 of the proof of Theorem 1, we can show that the estimated change point locations are consistent and D ( ) = D 0 with probability 1 -O(n -C ). Similar to equation 27, it follows from equation 34 that I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da = O(n -2β/(2β+p) ), up to some logarithmic factors. This further implies that the change point locations in D ( ) converge at a rate of O(n -2β/(2β+p) ) up to some logarithmic factors. We next establish the statistical properties of our value estimates. We first observe that when b is known, our estimator is unbiased to the inverse-propensity score weighted estimator 1 n L =1 I∈ D ( ) i∈L I(A i ∈ I) I(π(X i ) ∈ I) b(I|X i ) Y i . Consequently, its bias is equal to ). Consequently, the variance of our value estimator is bounded by O p (n -1 ) as well. In addition, we note this bound and the bias bound in equation 36 is uniform in Q ∈ Q 1 ∪ Q 2 and π ∈ Π. The proposed value converges at a rate of O p (n -2β/(2β+p) ) up to some logarithmic terms. When 2β/(2β + p) > 1/2, both the bias and the variance decay at a rate of n -1/2 up to some logarithmic factors. This completes the proof for the first part. Step 3. We provide a lower bound for the minimax convergence of kernel-based OPE when Q ∈ Q 2 in this step. Similar to Step 1, we can show its variance is lower bounded by O(n -1 h -1 ). When h n -κ for some κ > 3/5, the root mean squared error is larger than or equal to O(n -(1-κ)/2 ). Consider the Q-function Q(x, a) = Ch -1 K a -π(x) h , for some constant C > 0. With proper choice of C, we can show that such a choice of Q-function belongs to Q 2 . Using similar arguments in Step 1, we can show the bias equals EC -1 K 2 {(A -π(X))/h} h 2 b(A|X) ≥ C -1 E K 2 {(A -π(X))/h} h 2 . Using similar arguments in Step 1, we can show the right-hand-side is lower bounded O(1)h. When h n -κ for some κ < 1/5, the root mean squared error is larger than or equal to O(n -κ ). To summarize, when h n -κ for some κ < 1/5 or κ > 3/5, kernel-based OPE converges at a rate of n -κ * for some κ * < 1/5. Step 4. We provide an upper bound for the minimax convergence of our estimator when Q ∈ Q 2 in this step. ). By setting γ to be proportional to n -3/5 (this rate is achievable under the condition that 4β > 3p), we obtain the rate of n -1/5 .



Figure 1: Left panel: example of piece-wise constant function in change point literature. Middle panel: the oracle Q-function on the feature-action space. Right Panel: the green curve presents the oracle Q-function Q(x, π(x)) under policy π(x) = x; and the red curve is the fitted mean value by DJQE and the pink dash line corresponds to the 95% confidence bound.

Figure 2: The box plot of the estimated values of the optimal policy under the proposed DJQE and two kernel-based methods for Scenario 1-4.

1)}; 5. Construct the DR value estimator: r ← m; l ← τ [r]; While r > 0: (i) Let I = [l/m, r/m) if r < m else I = [l/m, 1]; D ← D ∪ I; (ii) Recall fitted MLP: q I (•) ← I(i ∈ L c )I(A i ∈ I)Y i ∼ I(A i ∈ I)M LP (X i ); (iii) Fit propensity score: b I (•) ← I(i ∈ L c )I(A i ∈ I) ∼ I(A i ∈ I)M LP (X i );

Figure 4: The bias of the estimated value and the computational cost (in minutes) under the DJQE with different initial number of intervals (m) when n = 100 in Scenario 1.

known such that b = b ( ) I = b. In addition, assume the conditional variance of Y given A and X is uniformly bounded away from zero.

w.p.a.1. Similar to equation 12, we have EI(A ∈ I){Y -MLP(X; θ ( )

any integer k such that 1 ≤ k ≤ |D 0 |-1, let τ * 0,k be the change point location that satisfies τ * 0,k = i/m for some integer i and that |τ 0,k -τ * 0,k | < m -1 . Denoted by D * the oracle partition formed by the change point locations {τ * 0

|D * | = |D 0 |, it follows from equation 27 that I∈ D ( ) I E|Q(X, a) -q I (X)| 2 da ≤ γ{|D 0 | -| D ( ) |} + o p (1).(28)Note that the left-hand-side (LHR) is non-negative. As γ > 0, we obtain that | D ( ) | ≤ |D 0 |, w.p.a.1.

Combining the results obtained in the first two steps yields that |D ( ) | = |D 0 |, w.p.a.1. It follows from equation 28 thatI∈ D ( ) I E|Q(X, a) -q I (X)| 2 da = o p (1).Using similar arguments inStep 2, we can show max τ ∈J(D0) min τ ∈J( D ( ) ) | τ -τ | < w.p.a.1, for any sufficiently small > 0. The proof is hence completed.

o p (1). In view of equation 31, the value difference |V (π D ( ) ) -V (π)| can be upper bounded by I0∈D0 Pr(π(X) ∈ I 0 -I 0 ) + o p (1).

Assume conditions in Theorem 2 hold. Then for any interval I ∈ I(m) with |I| γ and any interval I ∈ D ( ) with I ⊆ I , we have w.p.a.1 that E|q I (X) -q I (X)| 2 = o(1), where the little-o term is uniform in I and I .

|I | ≥ d n . Then according to Lemma 3, we can find some I such that a ∈ I ⊆ I , |I| → 0, E|q I (X) -q I (X)| 2 = o(1). Since |I| → 0 and a ∈ I, it follows from the uniform continuity of the Q-function that |q I (X) -Q(X, a)| → 0. The assertion thus follows. Next, suppose |I | < d n . Then |I | → 0 as well. It follows from the uniform continuity of the Q-function that inf a∈I |q I (X) -Q(X, a)| → 0. The assertion thus follows. This completes the proof for the first step. Step 2. Using similar arguments in Step 4 of the proof of Theorem 1, it suffices to show V (π D ( ) ) = V (π) + o p (1). By definition, V (π D ( ) ) -V (π) = I∈ D ( ) I EQ(X, a)I(π(X) ∈ I) b(a|X) b(I|X) da -EQ(π(X), X) = I∈ D ( ) I E{Q(X, a) -Q(π(X), X)}I(π(X) ∈ I) b(a|X) b(I|X) da.

There exists another interval I * ∈ I(m) that satisfies I = I * ∪ I. Notice that the partition D ( ) * = D ( ) ∪ {I * } ∪ I -{I } corresponds to another partition. By definition, we have 1

There exist two intervals I * , I * * ∈ I(m) that satisfy I = I * ∪ I ∪ I * * . Using similar arguments in proving equation 32, we can show that 1 |L |

|I|), uniformly for any I and I , or equivalently, E b(I|X) |I| {q I (X) -q I (X)} 2 |{O i } i∈L = o p (1).

X) -q I (X)}b(a|X)da.Since b is bounded away from zero and infinity, and that sup I Pr(π(X) ∈ I)/|I| ≤ C 4 , the absolute value of the bias is upper bounded byO, x) -q I (x)|da,where O(1) denotes some positive constant. For each I ∈ D ( ) , we could find some I 0 ∈ D 0 such that the lengths of I -I 0 and I 0 -I are upper bounded by O(n -2β/(2β+p) ) up to some logarithmic factors. By definition, we haveq I (x) = a∈I I0∈D0 b(a|x)I(a ∈ I 0 )q I0 (x)da b(I|x) .Consequently, sup x |q I (x) -q I0 (x)| = O(n -2β/(2β+p) ) up to some logarithmic factors. It follows that the absolute value of the bias of bias is upper bounded by O(|D ( ) |n -2β/(2β+p) ) up to some logarithmic factors. As |D ( ) | = |D 0 | ≤ C 1 with probability tending to 1, the absolute value of the bias is upper bounded byO p (n -2β/(2β+p) ),(36)up to some logarithmic factors.We next consider the variance. The variance is of the same order of magnitude of that of equation 35.Since L is finite, it suffices to consider the variance of n-1 | D ( ) |),where the last equality is due to that b is uniformly bounded way from zero and that Pr(π(X) ∈ I) ≤ O(1)|I| for some positive constant O(1). As |D ( ) | = |D 0 | ≤ C 1 with probability tending to 1, the variance of equation 37 is upper bounded by O p (n -1

Consider its bias first. Using similar arguments in Step 2, the bias is upper bounded by O(L -1 )L =1 I ∈ D ( ) a∈I sup x |Q(a, x) -q I (x)|da.Similar to Lemma 3, we can show for any interval I ∈ I(m) with |I| γ and any interval I ∈ D ( ) with I ⊆ I , we have w.p.a.1 thatE|q I (X) -q I (X)| 2 = Oγ is proportional to n -2β/(2β+p) up to some logarithmic factors. For any a, there exists an interval I whose length is proportional to γ log n that covers a. Since Q ∈ Q 2 , we have sup x |Q(a, x) -q I (x)| = O(|I|). This together with equation 38 yields thatI ∈ D ( ) a∈I sup x |Q(a, x) -q I (x)|da ≤ O(1) |I| + γ |I| . Set I = γ 1/3 ,the bias is upper bounded by O(1)γ 1/3 . Using similar arguments in Step 2 of the proof, the standard deviation is upper bounded by L =1 L -1 n -1 | D ( ) |. By equation 23, It is upper bounded by O(n -1/2 γ -1/2

The bias, the standard deviation, and the mean squared error of the estimated values under the optimal policy via the proposed DJQE and two kernel-based methods for the Warfarin data.

The averaged computational cost (in minutes) under the proposed DJQE and two kernelbased methods for Scenario 1.

The bias and the standard deviation (in parentheses) of the estimated values of the optimal policy under the proposed DJQE and two kernel-based methods for Scenario 1 to 4.

The averaged size of the final estimated partition (| D|) in comparison to the initial number of intervals (m) under the proposed DJQE for Scenario 1 to 4.

The mean squared error (MSE) 5 , the normalized root-mean-square-deviation (NRMSD) 6 , the mean absolute error (MAE) 7 , and the normalized MAE (NMAE) 8 of the fitted model under the MLP regressor, linear regression, and the random forest algorithm, via ten-fold cross-validation.

See https://en.wikipedia.org/wiki/Mean_squared_ error.

