UNCERTAINTY-AWARE OFF POLICY LEARNING

Abstract

Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various real-world applications, such as search engines, recommender systems, etc. While the ground-truth logging policy, which generates the logged data, is usually unknown, previous work directly takes its estimated value in off-policy learning, resulting in a biased estimator. This estimator has both high bias and variance on samples with small and inaccurate estimated logging probabilities. In this work, we explicitly model the uncertainty in the estimated logging policy and propose a novel Uncertaintyaware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.

1. INTRODUCTION

In many real-world applications, including search engines (Agarwal et al. (2019) ), online advertisements (Strehl et al. (2010) ), recommender systems (Chen et al. (2019) ; Liu et al. (2022) ), only logged feedback data is available for subsequent policy optimization. For example, in recommender systems, various complex recommendation models (i.e., policies) (Zhou et al. (2018) ; Guo et al. (2017) ) were optimized with logged user interactions (e.g., clicks or staytime) to items recommended by previous recommendation policies. However, such logged data is known to be biased, since one does not know the feedback on items that previous policy (which is generally referred as the logging policy) did not take. This inevitably distorts the evaluation and optimization of a new policy when it tends to select items that are not in the logged data. Off-policy learning (Thrun & Littman (2000) ; Precup (2000) ) emerges as a favorable way to learn an improved policy only from the logged data by addressing the mismatch between the learning policy and the logging policy. One of the most commonly used off-policy learning methods is Inverse Propensity Scoring (IPS) (Chen et al. (2019) ; Munos et al. (2016) ), which assigns per-sample importance weight to the training objective on the logged data, so as to get an unbiased optimization objective in expectation. The importance weight in IPS is the probability ratio between the learning policy and the logging policy. However, the ground-truth logging policy is unavailable to the learner, e.g., it is not recorded in the data. One common treatment taken by previous work (Strehl et al. (2010) ; Liu et al. (2022) ; Chen et al. (2019) ; Ma et al. (2020) ) is to first employ a supervised learning method (e.g., logistic regression, neural networks, etc.) to estimate the logging policy, and then take the estimated logging policy for off-policy learning. We theoretically show that such an approximation results in a biased estimator which is sensitive to those inaccurate and small estimated logging probabilities. Worse still, the small values of the estimated logging probabilities usually mean that there are fewer related samples in the logged data, so its estimation usually has high uncertainties, i.e., inaccurate estimation with high probability. Figure 1 shows a piece of empirical evidence from a large-scale recommendation benchmark KuaiRec dataset (Gao et al. (2022) ), where items with lower frequencies in the logged dataset have lower estimated logging probabilities and higher uncertainties concurrently. The high bias and variance caused by these samples greatly hinder the performance of off-policy learning. In this work, we explicitly take the uncertainty of the estimated logging policy into consideration and design a novel Ucertainty-aware Inverse Propensity Score estimator (UIPS) as the optimization objective for policy learning. UIPS introduces an additional weight to approach the ground-truth propensity from the estimated one, and learns an improved policy by alternating: (1) Find the optimal weight that makes the estimator as accurate as possible, taking into consideration the uncertainty of the estimated logging policy; (2) Improve the policy by optimizing the resulting estimator. We further find a closed-form solution for the optimal weight by deriving an upper bound on the mean squared error (MSE) to the ground-truth policy value. The optimal weight adjusts sample weights considering both the uncertainty of estimated logging probabilities and the propensity scores, rather than simply boosting or penalizing samples with high uncertain logging probabilities. Experiment results on the synthetic and three real-world recommendation datasets demonstrate the efficiency of UIPS. All data and code can be found in supplementary materials for reproducibility. To summarize, our contribution in this work is as follows: • We point out that directly using the estimated logging policy leads to sub-optimal off-policy learning, since the resulting biased estimator is greatly distorted by samples with inaccurate and small estimated logging probabilities. • We take the uncertainty of the estimated logging policy into consideration and propose UIPS for more accurate off-policy learning. • Experiments on synthetic and three real-world recommendation datasets demonstrate UIPS's strong advantage over state-of-the-art methods.

2. PRELIMINARY: OFF-POLICY LEARNING

We focus on the standard contextual bandit setup to explain the key concepts. Following convention (Joachims et al. (2018) ; Saito & Joachims (2022) ; Su et al. (2020) ), let x ∈ X ⊆ R d be a ddimensional context vector drawn from an unknown probability distribution p(x). Each context is associated with a finite set of actions denoted by A, where |A| < ∞. Let π : A × X → [0, 1] denote a stochastic policy, such that π(a|x) is the probability of selecting action a under context x and a∈A π(a|x) = 1. Under a given context, reward r x,a is observed when action a is chosen. Take news recommendation for example, x represents the state of a user, summarizing his/her interaction history with the recommender system, each action a is a candidate news article, the policy is a recommendation algorithm, and the reward r x,a denotes the user feedback on article a, e.g., whether the user clicks the article. Let V (π) denote the expected reward or value of the policy π: V (π) = E x∼p(x),a∼π(a|x) [r x,a ]. (1) We look for a policy π(a|x) to maximize V (π). In the rest of the paper, we denote E x∼p(x),a∼π(a|x) [•] as E π [•] for simplicity. In contrast to performing online updates by following the learning policy π(a|x), in off-policy learning we can only access a set of logged feedback data denoted by D := {(x n , a n , r xn,an )|n ∈ [N ]}, where [N ] := {1, . . . , N }. Given x n , the action a n was generated by a stochastic logging policy β * , i.e., the probability action a n was selected is β * (a n |x n ). The actions {a 1 , . . . , a N } and their corresponding rewards {r x1,a1 , . . . , r x N ,a N } are generated independently given β * . Due to the nature of policy optimization, the learning policy π(a|x) is expected to be different from β * (a|x), unless β * (a|x) is already optimal. Moreover, in practice the situation could be further complicated. Again, consider the news recommendation scenario. Due to the scalability requirement, industrial recommender systems usually adopt a two-stage framework (Ma et al. (2020) ), where one or several candidate generation models first produce a candidate set and a separate ranking model reranks candidate items to present top-K item to users. While β * (a|x) depicts the whole two-stage process, the learning policy π(a|x) is usually employed in one particular stage (e.g., the reranking stage), implying drastic differences between the logging and learning policies. The main challenge of offpolicy learning is then to address the distribution discrepancy between β * (a|x) and π(a|x), and learn a policy π(a|x) to maximize V (π) with access only to the logged dataset D. One of most widely used methods to address the distribution shift between π(a|x) and β * (a|x) is the Inverse Propensity Score (IPS) (Chen et al. (2019) ; Munos et al. (2016) ). One can easily get that: V (π) = E β * π(a|x) β * (a|x) r x,a , yielding the following empirical estimator of V (π): VIPS (π) = 1 N N n=1 π(an|xn) β * (an|xn) r xn,an , where π(a n |x n )/β * (a n |x n ) is referred to as the propensity score. In the rest of paper, without further specification, we use the empirical estimation of expectation in our practical calculation. Various algorithms can be readily used for policy optimization under VIPS (π), including valuebased methods (Silver et al. (2016) ), policy-based methods (Levine & Koltun (2013) ; Schulman et al. (2015) ; Williams (1992)). In this work, we adopt a well-known policy gradient algorithm, REINFORCE (Williams (1992) ). Assume the policy π(a|x) is parameterized by ϑ, via the "logtrick", the gradient of VIPS (π ϑ ) with respect to ϑ can be readily derived as follows: ∇ ϑ VIPS (π ϑ ) = 1 N N n=1 π(an|xn) β * (an|xn) r xn,an ∇ ϑ log(π ϑ (a n |x n )). Approximation with unknown logging policy. In many real-world applications, the ground-truth logging policy, i.e., the β * (a|x) of each observation (x, a), is unknown. One reason is the legacy issue, i.e., the probabilities were not logged when collecting data. Another reason is that the exact value of β * (a|x) is intrinsically unavailable such as in the two-stage recommender systems. As the solution, previous work employs various supervised learning methods (e.g., logistic regression (Schnabel et al. (2016) ), nerural networks (Chen et al. (2019) , etc.) to estimate the logging policy, and replaces β * (a|x) with its estimated value β(a|x) to get the following estimator for policy learning: VBIPS (π ϑ ) = 1 N N n=1 π ϑ (an|xn) β(an|xn) r xn,an . However, as shown in the following proposition, inaccurate β(a|x) leads to high bias and variance of VBIPS (π ϑ ). Worse still, smaller inaccurate β(a|x) further enlarges this bias and variance. Proposition 1. The bias and variance of VBIPS (π ϑ ) can be derived as follows: Bias VBIPS (π ϑ ) = E D VBIPS (π ϑ ) -V (π ϑ ) = E π ϑ r x,a β * (a|x) β(a|x) -1 N • Var D VBIPS (π ϑ ) = Var π ϑ β * (a|x) β(a|x) r x,a + E π ϑ π ϑ (a|x) β * (a|x) -1 • β * (a|x) 2 β(a|x) 2 r 2 x,a Smaller β(a|x) usually implies fewer related training samples in the logged data, and thus β(a|x) will be inaccurate with a higher probability. To make it more explicit, we take KuaiRec dataset (Gao et al. (2022) ) as an example and estimate the logging policy following (Chen et al. (2019) ). Figure 1 shows the estimated β(a|x) and its corresponding uncertainties in items of different observation frequencies in the logged dataset. As uncertainty measures how large the confidence interval is about the current estimation, higher uncertainty implies that the true value may be away from the empirical mean estimate with a high probability. We defer the discussion about our detailed uncertainty calculation in Section 3. We can observe from Figure 1 that as item frequency decreases, the estimated logging probability also decreases, but the estimation uncertainty increases. This implies that smaller β(a|x) is usually 1) more inaccurate and 2) associated with high uncertainty. As a result, with high bias and variance caused by inaccurate β(a|x), it is erroneous to improve π ϑ (a|x) by simply optimizing VBIPS (π ϑ ). We propose uncertainty-aware off-policy learning to address this challenge.

3. UNCERTAINTY-AWARE OFF-POLICY LEARNING

Our idea is incorporating the uncertainty of the logging policy estimation into policy learning. Observing that V (π ϑ ) = E β * π ϑ (a|x) β(a|x) • β(a|x) β * (a|x) • r x,a , we propose to learn the optimal policy by optimizing the following empirical estimator: VUIPS (π ϑ ) = 1 N N n=1 π ϑ (an|xn) β(an|xn) • ϕ xn,an • r xn,an where ϕ xn,an is a weight, which reflects β(a n |x n )/β * (a n |x n ), to be selected to make VUIPS (π ϑ ) as close to V (π ϑ ) as possible. Intuitively, one should give small weights to samples whose β(a|x) is far below the ground-truth β * (a|x). Thus, we divide offline policy improvement into two steps, and repeat them until certain convergence condition is met: • Uncertainty aware policy evaluation: Derive the optimal uncertainty aware ϕ x,a to make VUIPS (π ϑ ) as accurate as possible. • Policy Improvement: Learn an improved policy π ϑ (a|x) by optimizing VUIPS (π ϑ ).

3.1. Uncertainty Aware Policy Evaluation

Optimal uncertainty aware weight ϕ x,a . We measure the accuracy of VUIPS (π ϑ ) by its mean squared error (MSE) to V (π ϑ ) following previous work (Su et al. (2020) ; Saito & Joachims (2022) ). MSE captures both the bias and variance of an estimator, since it is the summation of squared bias and variance. We then locate the ϕ x,a that can minimize the MSE. In particular, we demonstrate the optimal ϕ x,a has a closed-form formula which relates to both the value of π ϑ (a|x)/ β(a|x) and the estimation uncertainty of β(a|x). More specifically, instead of directly minimizing the MSE, which is intractable, we find the desirable ϕ x,a by minimizing the upper bound of MSE in the following theorem. Theorem 1. Assume r x,a ∈ [0, 1], the mean squared error (MSE) between VUIPS (π ϑ ) and groundtruth estimator V (π ϑ ) is upper bounded as follows: MSE VUIPS (π ϑ ) = E D VUIPS (π ϑ ) -V (π ϑ ) 2 = Bias VUIPS (π ϑ ) 2 + Var VUIPS (π ϑ ) ≤ E π ϑ r 2 x,a π ϑ (a|x) β * (a|x) • E β * β * (a|x) β(a|x) ϕ x,a -1 2 + E β * π ϑ (a|x) 2 β(a|x) 2 ϕ 2 x,a The upper bound in Theorem 1 strictly increases with the two expectations related to ϕ x,a , which implies that for some choice λ ∈ [0, ∞], the MSE-optimizing ϕ x,a can be derived by minimizing: λE β * β * (a|x) β(a|x) ϕ x,a -1 2 + E β * π ϑ (a|x) 2 β(a|x) 2 ϕ 2 x,a . We cannot directly minimize Eq.( 6) since the unknown β * (a|x) is involved. However, various ways (Gal & Ghahramani (2016) ; Xu et al. (2021) ) can be employed to get the confidence interval which will contain β * (a|x) with high probability. More specifically, following previous work (Joachims et al. (2018) ), we assume β * (a|x) can be modelled by a softmax function on top of an unknown function f θ * (x, a), i.e., the realizable assumption. Then we can get: β * (a|x) = exp(f θ * (x,a)) a ′ exp(f θ * (x,a ′ )) , β(a|x) = exp(f θ (x,a)) a ′ exp(f θ (x,a ′ )) , where f θ (x, a) is an estimate of f θ * (x, a). Following the conventional definition of confidence interval (Abbasi-Yadkori et al. ( 2011)), we define γ and U x,a such that |f θ * (x, a)-f θ (x, a)| ≤ γU x,a holds with probability at least 1-δ, where γ is a function of δ (typically the smaller δ is, the larger γ is). Then γU x,a measures the width of confidence interval of f θ (x, a) against its groundtruth f θ * (x, a). This implies that β * (a|x) ∈ B x,a with probability at least 1-δ, where: B x,a = Ẑ exp(-γUx,a) Z * β(a|x), Ẑ exp(γUx,a) Z * β(a|x) , Z * = a ′ exp(f θ * (a ′ |x)), Ẑ = a ′ exp(f θ (a ′ |x)). Since β * (a|x) can be any value in B x,a , we adopt the idea of robust optimization (Chen et al. (2020) ) and find the optimal ϕ x,a by solving the following optimization problem: min ϕx,a max βx,a∈Bx,a λE β * βx,a β(a|x) ϕ x,a -1 2 + E β * π ϑ (a|x) 2 β(a|x) 2 ϕ 2 x,a . The following theorem derives a closed-form formula for the optimal solution of (8) . Theorem 2. Let η 1 , η 2 ∈ [exp(-γU max x ), exp(γU max x )] , where U max x = max a U x,a . The optimization problem in Eq.( 8) has a closed-form solution as follows: ϕ * x,a = min λ/ λ η1 exp (-γU x,a ) + η1π ϑ (a|x) 2 β(a|x) 2 exp(-γUx,a) , 2η 2 / [exp (γU x,a ) + exp (-γU x,a )] Insights on ϕ * x,a . The second term of ϕ * x,a (i.e., 2η 2 / [exp (γU x,a ) + exp (-γU x,a )] ) acts like a capping threshold to ensure ϕ * x,a ≤ 2η 2 holds even with small π ϑ (a|x)/ β(a|x) as shown in Lemma 1 in Appendix A.4. The key component is the first term, and Lemma 1 implies that: • If the propensity score π ϑ (a|x)/ β(a|x) is above the threshold √ λ/η 1 , UIPS will assign a smaller weight to a sample with more inaccurate β(a|x) to prevent its distortion from a large propensity score but an inaccurate logging probability. • If the propensity score π ϑ (a|x)/ β(a|x) is below the threshold but not small enough to activate the second term, then the propensity score at the worse case (i.e., taking B - x,a = β(a|x) Ẑ exp (-γU x,a ) /Z * as denominator) matters. If the propensity score at the worse case is under control, i.e., π ϑ (a|x)/B - x,a < √ λ, a larger U x,a implies a small propensity score π ϑ (a|x)/ β(a|x), and UIPS tends to boost this safe sample with a higher ϕ * x,a . Otherwise ϕ * x,a still decreases as U x,a becomes higher. Uncertainty estimation. Now we describe how to calculate U x,a , i.e., the uncertainty of the estimated β(a|x). In this work, we choose to estimate β * (a|x) using a neural network, due to its encouraging representation learning capacity. And various ways (Gal & Ghahramani (2016) ; Xu et al. (2021) ) can be leveraged to perform the uncertainty estimation in a neural network. For example, (Gal & Ghahramani, 2016) proposed to estimate uncertainty using dropout; and (Xu et al., 2021) provided a theoretical bound. Here we adopt the result in (Xu et al. (2021) ) due to its computational efficiency and theoretical soundness. Following the proof of Theorem 4.4 in (Xu et al. (2021) ), given the logged dataset D, we can get with high probability ∃γ: |f θ (x n , a n ) -f θ * (x n , a n ))| ≤ γ g(x n , a n ) T M -1 D g(x n , a n ) where g(x n , a n ) is the gradient of f θ (x n , a n ) regarding to its last layer, i.e., g(x n , a n ) = ∇ θw f θ (x n , a n ), where θ w ⊂ θ is the parameter of the last layer of f θ (x n , a n ). And M D = N n=1 g(x n , a n )g(x n , a n ) T , implying U xn,an = g(x n , a n ) T M -1 D g(x n , a n ).

3.2. Policy Improvement

After getting the optimal ϕ * x,a as in Theorem 2, the policy π ϑ (a|x) can be updated by the following REINFORCE gradient: ∇ ϑ V UIPS (π ϑ ) = E β * π ϑ (a|x) β(a|x) • ϕ * x,a • r x,a ∇ ϑ log(π ϑ (a|x)) . (9) UIPS then iterates policy evaluation and policy improvement for policy learning until converge. The whole algorithm framework and important notations are summarized in Algorithm 1 and Table 6 in Appendix A.1 respectively.

4. EMPIRICAL EVALUATION

In this section, we evaluate UIPS on both synthetic datasets and three real-world datasets with unbiased data. We compare UIPS with the following baselines, which can be grouped into five categories: • Cross-Entropy (CE): A supervised learning method with the cross-entropy loss as its objective, which is the commonly used learning approach for a model with softmax output. No off-policy correction is performed in this method. • IPS-Cap (Chen et al. ( 2019)): The standard IPS based off-policy learning, which prunes propensity scores to control variance, i.e., taking min(c, π ϑ (a|x) β(a|x) ) as the propensity score. Setting c to a small value can reduce variance, but introduces bias. • MinVar & stableVar (Zhan et al. (2021) ), Shrinkage (Su et al. (2020) ): This line of work improves off-policy evaluation estimators by reweighing each sample. For example, MinVar and sta-bleVar reweigh each sample by hx,a a|x) respectively, since they find that π ϑ (a|x) 2 / β(a|x) is directly related to variance. Su et al. (2020) proposes to shrink the propensity score by multiplying a weight λ/(λ + π ϑ (a|x) 2 β(a|x) 2 ), which is a special case of the proposed UIPS with U x,a = 0 and η 1 = 1. All these work simply treats β(a|x) as β * (a|x), and none of them consider the accuracy or uncertainty of β(a|x). 2021)). UIPS-O adversarially uses the worst propensity scores (π ϑ (a|x)/B - x,a ) for policy learning, i.e., ϕ x,a = 1.0/ exp(-γU x,a ). a ′ h x,a ′ with h x,a = β(a|x) π ϑ (a|x) 2 and h x,a = √ β(a|x) π ϑ (

4.1. Synthetic Data

Data generation. Following previous work (Ma et al. (2020) ; Lopez et al. ( 2021)), we generate a synthetic dataset by a supervision-to-bandit conversion on Wiki10-31K dataset (Bhatia et al. (2016) ), which is an extreme multi-label classification dataset. The Wiki10-31K dataset contains approximately 20K samples. Each sample is associated with a feature vector x of 101,938 dimensions and a label vector y x of 31K classes with more than one positive class. Let y x,a denote the label of class a under x and we take each class as an action. We adopt the Wiki10-31K dataset rather than ones in the UCI machine learning repository (Swaminathan & Joachims (2015a) ), since it will be much harder with such a large action space. We then split the dataset into train, validation, test sets with size 11K:3K:6K. The test set is from the official split. Since the original feature vector x is too sparse, for ease of learning, we first embed it to dimension d by x = W x, and synthesize the ground-truth logging policy β * (a|x) by: β * (a|x) = exp(x T θ * a /τ ) a ′ exp(x T θ * a ′ /τ ) , ( ) where W and {θ * a } are pre-learned parameters by applying a logistic regression model on the train set, τ is a hyper-parameter that controls the skewness of logging distribution. A small value of τ leads to a near-deterministic policy, while a larger τ makes logging policy smoother. Due to space limit, more details on data generation and implementation can be found in Appendix A.2. Evaluation metrics. To evaluate the learned policy π ϑ (a|x), we calculate Precision@K (P@K), Recall@K (R@K) and NDCG@K as in previous work (Lopez et al. (2021) ; Ma et al. (2020) ). Higher P@K, R@K and NDCG@K imply a better policy. Table 1 shows the mean performance and standard deviations of all algorithms under 10 random seeds on three synthetic datasets generated under different τ . Since the ground-truth logging policy is accessible on the synthetic datasets, we include a new baseline IPS-GT, which depicts the performance the IPS estimator can achieve, assuming the ground-truth logging probabilities are known and sample size is sufficiently large. We calculate p-value under t-test between UIPS and the best baseline on each dataset to investigate the significance of improvement. First, we can observe that UIPS achieves similar and even better performance than IPS-GT when τ = 0.5 and τ = 1, but τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 IPS-GT 0.5589±1e -3 0.1582±6e -4 0.6093±1e -3 0.5526±2e -3 0.1565±6e -4 0.6007±1e -3 0.5531±2e -3 0.1557±7e -4 0.6037±1e -3 CE 0.5553±6e -4 0.1573±2e -4 0.6037±5e -4 0.5510±6e -4 0.1561±2e -4 0.5995±4e -4 0.5386±2e -3 0.1524 ±7e -4 0.5874±2e -3 IPS-Cap 0.5515±2e -3 0.1553 ±8e -4 0.6031±2e -3 0.5526±2e -3 0.1561±6e -4 0.6016±1e -3 0.5409±3e -3 0.1529±9e -4 0.5901±2e -3 MinVar 0.5340±2e -3 0.1509 ±6e -4 0.5857±2e -3 0.5282±2e -3 0.1491±7e -4 0.5791±2e -3 0.5036±4e -3 0.1415±1e -3 0.5543±3e -3 StableVar 0.4577±5e -3 0.1310 ±1e -3 0.5111±2e -3 0.5373±3e -3 0.1523±9e -4 0.5866±3e -3 0.5279±3e -3 0.1492±8e -4 0.5781±3e -3 Shrinkage 0.5526±2e -3 0.1562 ±7e -4 0.6024±1e -3 0.5499±4e -3 0.1545±1e -3 0.6040±3e -3 0.5347±2e -3 0.1513 ±6e -4 0.5824 ±2e -3 SNIPS 0.2616±6e -2 0.0749±2e -2 0.3150±7e -2 0.3538±5e -2 0.0987±1e -2 0.4144±6e -2 0.4379±3e -2 0.1226±9e -3 0.5177±3e -2 BanditNet 0.4011±3e -2 0.1131±8e -3 0.4830±2e -2 0.3894±4e -2 0.1095±1e -2 0.4741±3e -2 0.4122±3e -2 0.1153±8e -3 0.4934±3e -2 POEM 0.5480±2e -3 0.1539±8e -4 0.6008±2e -3 0.5502±2e -3 0.1551±6e -4 0.6000±2e -3 0.5399±2e -3 0.1526±8e -4 0.5893±2e -3 POXM 0.4006±3e -2 0.1130±8e -3 0.4828±2e -2 0.3616±4e -2 0.1019±1e -2 0.4522±4e -2 0.3816±4e -2 0.1069±1e -2 0.4680±4e -2 Adaptive 0.3831±2e -2 0.1050±4e -3 0.4382±2e -2 0.4734±4e -3 0.1325±1e -3 0.5326±3e -3 0.3936±1e -2 0.1097±4e -3 0.4368±2e -2 UIPS-P 0.4019±3e -2 0.1131±1e -2 0.4831±3e -2 0.3904±4e -2 0.1096±1e -2 0.4749±3e -2 0.4109±3e -2 0.1149±1e -2 0.4922±3e -2 UIPS-O 0.4135±4e -2 0.1167±1e -2 0.4954±4e -2 0.3896±4e -2 0.1096±1e -2 0.4739±3e -2 0.4519±3e -2 0.1268±8e -3 0.5296±2e -2 UIPS 0.5608±2e -3 0.1589±8e -4 0.6113±3e -3 0.5572±2e -3 0.1571±8e -4 0.6074±2e -3 0.5432±3e -3 0.1534±8e -4 0.5946±2e -3 p-value 4e -6 4e -5 2e -10 2e -7 2e -3 4e -10 1e -1 2e -1 4e -2 Table 1 : Experimental results on synthetic datasets. The best and second best results are highlighted with bold and underline respectively. The p-value under the t-test between UIPS and the best baseline on each dataset is also provided. Low Frequent Action Related Samples ( High Uncertainty) High Frequent Action Related Samples (Low Uncertainty) Algorithm P@5(RI) R@5(RI) NDCG@5(RI) P@5(RI) R@5(RI) NDCG@5(RI performs worse than IPS-GT on the dataset with τ = 2. Although IPS-GT can access the groundtruth logging probabilities, it still suffers from high variance caused by samples with small logging probabilities, which is the main cause of its worse performance when τ = 0.5 and τ = 1. When the ground-truth logging policy is smoother (e.g., τ = 2), the variance of the IPS estimator becomes much smaller, and off-policy correction with the ground-truth logging probabilities, rather than the estimated ones, leads to better model performance. We can then observe that as τ increases, i.e., the probability of selecting positive actions decreases, the performance of most algorithms drop, including CE, IPS-Cap, UIPS, Shrinkage, POEM, Adaptive, etc. However, UIPS still achieves the best performance on all three datasets under all three metrics. And as τ decreases, the improvement of UIPS becomes larger and more significant. SNIPS, BanditNet , POXM are more robust to small logging probabilities of positive actions. UIPS consistently outperforms Shrinkage (a special case of UIPS with uncertainties always being zero) on all three datasets, demonstrating the benefits of considering the estimation uncertainty. Finally, regardless of the scale of propensity scores, blindly reweighing through uncertainties also leads to poor performance, as shown by UIPS-P and UIPS-O. Performance under different uncertainty levels. As shown in Figure 1 , low-frequency actions in the logged dataset suffer higher uncertainties in their propensity estimation. Thus, we divide the test set into two subsets according to the average frequency of associated actions, where the uncertainty in the subset associated with low-frequency actions is on average 9% higher than that in the subset associated with high-frequency actions. Table 2 shows the results on these two subsets when τ = 0.5. We only report the results of the best three baselines due to space limit. One can clearly observe that only UIPS performed better than CE on the test set associated with low-frequency actions, implying the advantage of UIPS in dealing with the inaccurately estimated logging probabilities. Yahoo Coat KuaiRec Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@50 R@50 NDCG@50 CE 0.2819±2e -3 0.7594±6e -3 0.6073±7e -3 0.2799±5e -3 0.4618±1e -foot_1 0.4529±7e -3 0.8802±2e -3 0.0240±8e -5 0.8810±6e -3 IPS-Cap 0.2751±2e -3 0.7419±8e -3 0.5928±7e -3 0.2758±6e -3 0.4582±7e -3 0.4399±9e -3 0.8750±3e -3 0.0238±7e -5 0.8788±5e -3 MinVar 0.2843±4e -3 0.7685±1e -2 0.6168±1e -2 0.2813±3e -3 0.4668±9e -3 0.4414±8e -3 0.8827±1e -3 0.0240±5e -5 0.8886±2e -3 StableVar 0.2787±2e -3 0.7499±7e -3 0.5919±7e -3 0.2840±3e -3 0.4662±5e -3 0.4393±7e -3 0.8524±7e -3 0.0231±2e -4 0.8570±4e -3 Shrinkage 0.2843±3e -3 0.7654±8e -3 0.6204±7e -3 0.2790±5e -3 0.4636±4e -3 0.4464±1e -2 0.8744±3e -3 0.0238±9e -5 0.8771±6e -3 SNIPS 0.2222±4e -3 0.5828±1e -2 0.4357±1e -2 0.2643±7e -3 0.4287±1e -2 0.4009±9e -3 0.8411±6e -3 0.0228±2e -4 0.8431±6e -3 BanditNet 0.2413±8e -3 0.6442±2e -2 0.4988±2e -2 0.2781±8e -3 0.4527±1e -2 0.4251±1e -2 0.8758±5e -3 0.0239±2e -4 0.8810±4e -3 POEM 0.2732±3e -3 0.7357±1e -2 0.5880±1e -2 0.2791±4e -3 0.4566±6e -3 0.4375±6e -3 0.7785±1e -2 0.0210±2e -4 0.7779±6e -3 POXM 0.2250±5e -3 0.5940±1e -2 0.4542±2e -2 0.2663±6e -3 0.4308±9e -3 0.4006±1e -2 0.8962±1e -2 0.0245±4e -4 0.9041±1e -2 Adaptive 0.2762±3e -3 0.7451±9e -3 0.5919±8e -3 0.2830±3e -3 0.4634±5e -3 0.4217±5e -3 0.8375±1e -2 0.0227±4e -4 0.8460±1e -2 UIPS-P 0.1829±8e -3 0.4560±3e -2 0.3300±1e -2 0.2685±7e -3 0.4364±9e -3 0.4087±7e -3 0.8638±8e -3 0.0235±3e -4 0.8685±7e -3 UIPS-O 0.1947±3e -3 0.4959±1e -2 0.3600±8e -3 0.2657±5e -3 0.4306±9e -3 0.4146±9e -3 0.8651±8e -3 0.0235±2e -4 0.8697±7e -3 UIPS 0.2868±2e -3 0.7742±5e -3 0.6274±5e -3 0.2877±3e -3 0.4757±5e -3 0.4576±8e -3 0.9120±1e -3 0.0250±5e -5 0.9174±7e -4 P-value 4e -2 1e -2 3e -2 2e -2 6e -4 5e -5 6e -4 6e -4 1e -3 Table 5 : Experimental results on real-world datasets. The best and second best results are highlighted with bold and underline respectively. The p-value under the t-test between UIPS and the best baseline on each dataset is also provided. Ablation Study. In this experiment, we aim to answer two questions: (1) Can VUIPS (π ϑ) ) in Eq. (5) lead to more accurate off-policy evaluation? (2) How will UIPS perform with different hyperparameters. Due to space limit, we report results on synthetic dataset with τ = 0.5. To answer the first question, we evaluate the following ϵ-greedy policy: π(a|x) = 1-ϵ |Mx| • I{a ∈ M x } + ϵ/|A|, where M x contains all positive actions associated with feature vector x. Then for each x in the test set, we sample 1K data points in a similar way as discussed previously to calculate the value of estimators. Table 4 shows the MSE of the estimators to ground-truth policy value under 20 different random seeds. We only compared with baselines on off-policy evaluation estimator, i.e., IPS-Cap, MinVar, StableVar and Shrinkage. One can observe from Table 4 that UIPS does lead to the smallest MSE, implying the most accurate off-policy evaluation. For the second question, γ and η 2 1 /λ are the two most important hyperparameters as discussed in Appendix A.2.1. Thus we fix η 1 , η 2 , and vary λ and γ to track the performance of UIPS. Recall that a larger γ implies a higher chance the derived interval contains β * (a|x), while √ λ/η 1 is closely related to how UIPS works as discussed in "Insights on ϕ * x,a " in Section 3.1. Figure 3 reports NDCG@5 under diferent γ and λ. Results on P@5 and R@5 can be found in Appendix A.2.1. We can observe that to make UIPS perform, B x,a needs to be of high confidence, e.g., γ = 25 performed the best when τ = 0.5. Moreover, the threshold √ λ/η 1 cannot be too small or too large.

4.2. Real-World Data

Off-policy learning has its utility in recommendation scenarios (Chen et al. (2019) ; Ma et al. ( 2020)), where context vector x denotes the state of a user and each candidate item is taken as an action. To further demonstrate the efficiency of UIPS in real-world scenarios, we evaluate it on three recommendation datasets with unbiased testing data: (1) Yahoo!R3foot_0 ; (2)Coat 2 ; (3)KuaiRec (Gao et al. (2022) ), from music, fashion and micro-video recommendation scenario respectively. All these datasets contain an unbiased test set collected from a randomized controlled trial where items are randomly selected. The statistics of the three datasets and implementation details, e.g., model architectures and dataset splits, can be found in Appendix A.2.2. We still adopt P@K, R@K and NDCG@K as our evaluation metrics. Following (Ding et al. ( 2022)), we take K = 5 on Yahoo!R3 and Coat datasets, and K = 50 on KuaiRec dataset. The p-value under the t-test between UIPS and the best baseline on each dataset is also reported to investigate the significance of the improvements. We can first observe that on all three datasets, the proposed UIPS achieves the highest precision, recall and NDCG. IPS-Cap cannot outperform CE due to the inaccuracy of the estimated logging probabilities. BanditNet, POEM and POXM tend to perform better with a larger action space, while MinVar, StableVar and Shrinkage as well as Adaptive are more suitable for scenarios with small action size. UIPS still outperforms Shrinkage, highlighting the importance of modeling uncertainty in the estimated logging policy. However, reweighing based solely on uncertainties, ignoring the corresponding propensity scores, will also lead to poor performance, as shown by UIPS-P and UIPS-O.

5. RELATED WORK

This work is the first of its kind to take into consideration the uncertainty of the estimated logging policy for improved policy learning. The following two lines of work are related to this paper. Off-policy learning. In many real-world applications, such as search engines, recommender systems, etc., interactive online model update is expensive and risky (Jiang & Li (2016) ). Off-policy learning has therefore attracted increasing interest, since it can leverage the already logged feedback data (Agarwal et al. (2019) ; Chen et al. (2019) ; Liu et al. (2022) ). The main challenge in off-policy learning is how to address the mismatch between the logging policy and the learning policy. One line of work (Achiam et al. (2017) ; Schulman et al. (2015) ) circumvents this by constraining the learning policy not too far from the logging policy. However, such constraint is too restrictive thus not applicable in some scenarios such as recommender systems where user behaviors and items change rapidly. Another more common and widely-applied approach is to leverage Inverse Propensity Score (IPS) method to correct the discrepancy between two policies. And various methods are proposed for stabilized learning (Swaminathan & Joachims (2015c; a; b) ) and variance control (Lopez et al. (2021) ; Liu et al. (2022) ) on top of IPS. However, all these work directly use the estimated logging policy for off-policy correction, leading to sub-optimal performance as shown in our experiments. Some other work further extend IPS-based off-policy learning for more complex problems, such as slate recommendation (Swaminathan et al. (2017) ), two-stage recommender systems (Ma et al. (2020) ), etc. But they still fail to realize the effect of accuracy of the estimated logging policy. A recent work (Ding et al. (2022) ) on causal recommendation also argues that propensity scores may not be correct due to unobserved confounders. However, they assume the effect of unobserved confounder for any sample can be bounded by a pre-defined hyper-parameter, and adversarially search for the worst-case propensity to update model parameters. Adapting to off-policy learning, it is a special case of our UIPS-O variant with uncertainty as a pre-defined constant. Off-policy learning can be directly built on off-policy evaluation. In this line of research, several work (Su et al. (2020) ; Zhan et al. (2021) ) also propose to control the high variance of learning caused by small logging probabilities by instance reweighing. However, they directly take the estimated logging policy as true logging policy for correction, thus worse than UIPS as shown in experiments. A recent work (Saito & Joachims (2022) ) assumes additional structure in action space and proposes the marginalized IPS. Instead, our work considers the uncertainty when estimating the logging policy and thus does not add new assumptions about the problem space. Uncertainty-aware Learning. Estimation uncertainty has been extensively used for making tradeoffs between exploration and exploitation in online learning (Xu et al. (2021) 2022)) penalize the value function of out-of-distribution states and actions by directly subtracting uncertainty to tackle the extrapolating error. However, blindly penalizing samples of high uncertainty (i.e., UIPS-P) is problematic, as shown in our experiments. Proper correction depends on both uncertainty in logging policy estimation and the actual value of estimated logging probabilities.

6. CONCLUSION

In this paper, we propose a novel Uncertainty-aware Inverse Propensity Score estimator (UIPS) to explicitly model the uncertainty about the estimated logging policy for improved off-policy learning. UIPS weighs each logged instance to approach the ground-truth estimator and a closed-form solution of the optimal weight is derived by minimizing the upper bound of the mean squared error (MSE). An improved policy can be obtained by optimizing the resulting estimator. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the efficiency of UIPS . As demonstrated in this work, explicitly modeling the uncertainty of the estimated logging policy is crucial for effective off-policy learning; but the best use of this uncertainty is not to simply downweigh or drop instances with uncertain estimations, but to balance it with the actually estimated logging probabilities in a per-instance basis. 

A APPENDIX

A.1 NOTATIONS AND ALGORITHM FRAMEWORK. For ease of reading, we list important notations in Table 6 and summarize the main framework of the proposed UIPS in Algorithm 1. Notation Description X context space A action set x ∈ R d context vector a action r x,a reward π(a|x) targeted policy to evaluate β * (a|x) the unknown ground-truth logging policy β(a|x) the estimated logging policy V (π) value function D := {(x n , a n , r xn,an )|n ∈ [N ]} logged dataset containing N samples ϕ * x,a the optimal uncertainty-aware weight f θ * (x, a) the unknown ground-truth function that generates β * (a|x) = exp(f θ * (x,a)) Note that calculating logging probability for each sample, which is essential for both UIPS and IPS, takes O(N d|A|) time. Since the dimension d is usually much less than action size |A| and samples size N , UIPS does not introduce significant computational overhead compared to the original IPS solution. a ′ exp(f θ * (x,a ′ )) f θ (x, a) the estimate of f θ * (x, a) that generates β(a|x) B x,a confidence interval of β(a|x) U x,a uncertainty defined as |f θ * (x, a) -f θ (x, a)| ≤ γU x,a g(x n , a n ) gradient of f θ (x,

A.2 EXPERIMENTS DETAILS

A.2.1 SYNTHETIC DATA Data generation. Given the ground-truth logging policy β * (a|x), we generate the logged dataset as follows. For each sample in train set, we first get the embedded context vector x form its original Algorithm 1: UIPS Input: The logged dataset D := {(x n , a n , r xn,an )|n ∈ [N ]}, the estimated logging policy model β(a|x) = exp(f θ (x,a)) a ′ exp(f θ (x,a ′ )) , latent dimension d. Init: M D = I d×d // calculate M D for uncertainty calculation. 1 for n = 1, 2, ..., N do 2 M D = M D + ∇ θ f θ (x n , a n )∇ θ f θ (x n , a n ) T ; 3 M inv D = inv(M D ) ; 4 for n = 1, 2, ..., N do 5 U xn,an = ∇ θ f θ (x n , a n ) T M inv D ∇ θ f θ (x n , a n ); // Main part of UIPS 6 while not converge do 7 for n = 1, 2, ..., N do 8 Calculating ϕ * xn,an as in Theorem 2 ; 9 Calculating gradients as in Equation ( 9) and updating π ϑ (a|x). Output: The learnt policy π ϑ (a|x). feature vector x. We then sample an action a according to β * (a|x), and obtain the reward r x,a = y x,a , resulting a bandit feedback (x, a, r x,a ), where y x,a is the label of class a under the original feature vector x. We repeat above process N times to collect the logged dataset. In our experiments, we take d = 64, N = 100. Implementation Details. We model the logging policy as in Equation ( 7) with f θ (x, a) = x T θ a , where {θ a } are parameters to learn. To train the logging policy, we take all samples in the logged dataset D as positive instances, and randomly sample non-selected actions as negative instances as in (Chen et al. (2019) ). We use grid search to select the hyperparameters based on the model's performance on validation dataset: the learning rate was searched in {1e -5 , 1e -4 , 1e -3 , 1e -2 }; λ, γ, η 1 were searched in {0.5, 0.1, 1, 2,5, 10, 15, 20, 25, 30, 40, 50}. And η 2 was searched in {1, 10, 100, 1000}. For baseline algorithms, we perform a similar grid search as mentioned above, and the search range follows the original papers. Ablation Study: Hyperparameter tuning. Although UIPS has four hyperparameters (λ, γ, η 1 , and η 2 ), one only needs to carefully finetune two of them, i.e., γ and η 2 1 /λ, to obtain good performance of UIPS. This is because: • η 2 acts like a capping threshold to ensure ϕ * x,a ≤ 2η 2 holds even with small propensity scores. Hence, it should be set to a large value (e.g., 100). • The key component (i.e., the first term) of ϕ * x,a can be rewritten in the following way. While all (x, a) pairs will be multiplied by ϕ * x,a , η 1 in the numerator will not affect final performance too much, and the key is to find a good value of η 2 1 /λ to balance the two terms in the denominator: η 1 / exp (-γU x,a ) + η 2 1 /λ • π ϑ (a|x) 2 β(a|x) 2 exp (-γU x,a ) . Thus with η 1 and η 2 fixed, effect of hyperparameter γ and λ on precision and recall can be found in Figure 2a and Figure 2b respectively.

A.2.2 REAL-WORLD DATA

Statistics of data. The statistics of three real-world recommendation datasets with unbiased data can be found in Table 7 . All these datasets contain a set of biased data collected from users' interactions on the platform, and a set of unbiased data collected from a randomized controlled trial where items are randomly selected. As in (Ding et al. (2022) ), on each dataset, the biased data is used for training, and the Under review as a conference paper at ICLR 2023 (a) Precision@5 (b) Recall@5 Figure 2 : Effect of λ and γ on Precision@5 and Recall@5. unbiased data is for testing, with a small part of unbiased data split for validation purpose (5% on Yahoo and Coat, and 15% on KuaiRec). We take the reward as 1 if : (1) the rating is larger than 3 in Yahoo!R3 and Coat datasets; (2) the user watched more than 70% of the video in KuaiRec. Otherwise, the reward is labeled as 0. Implementation Details. We adopt a two-tower neural network architecture to implement both the logging and learning policy, as shown in Figure 3 . For the learning policy, the user representation and the item representation are first modelled through two separate neural networks (i.e., the user tower and the item tower), and then their element-by-element product vector is projected to predict the user's preference for the item. We then re-use the user state generated from the user tower of the learning policy, and model the logging policy with another separate item tower, following (Chen et al. (2019) ). We also block gradients to prevent the logging policy interfering the user state of the learning policy. In each learning epoch, we will first estimate the logging policy, and then take the estimated logging probabilities as well as their uncertainties to optimize the learning policy. All hyperparameters are searched in a similar way described in Section 4.1. The doubly robust (DR) estimator (Jiang & Li (2016) ), which is a hybrid of direct method (DM) estimator and inverse propensity score (IPS) estimator, is also widely used for off-policy evaluation. More specifically, let η : X × A → R be the imputation model in DM that estimates the reward of action a under context vector x, and β(a|x) be the estimated logging policy in the IPS estimator. The DR estimator evaluates the policy π based on the logged dataset D := {(x n , a n , r xn,an )|n ∈ [N ]}, by: VDR (π) = VDM (π) + 1 N N n=1 π(an|xn) β(an|xn) (r xn,an -η(x n , a n )) where VDM (π) is the DM estimator: VDM (π) = 1 N N n=1 a∈A π(a|x n )η(x n , a). Again assuming the policy π(a|x) is parameterized by ϑ, the REINFORCE gradient of VDR (π ϑ ) with respect to ϑ can be readily derived as follows: ∇ ϑ VDR (π ϑ ) = 1 N N n=1 a∈A π ϑ (a|x n )η(x n , a)∇ ϑ log(π ϑ (a|x n )) + 1 N N n=1 π(an|xn) β(an|xn) (r xn,an -η(x n , a n )∇ ϑ log(π ϑ (a n |x n ) . ( ) The imputation model η(x, a) is pre-trained following previous work (Liu et al. (2022) ) with the same model architecture as the logging policy model. Besides the standard DR estimator, we also adapt UIPS and the best two baselines on off-policy evalaution estimator (i.e., MinVar and Shrinkage) to doubly robust setting using the same imputation model. Table 8 and Table 9 show the results on the synthetic datasets and three real-world datasets respectively. For ease of comparison, we also include the experimental results of IPS-Cap and UIPS on each dataset in two tables. Two p-values are also provided: (1) P-value(UIPSDR): The p-value under the t-test between UIPSDR and the best DR baseline on each dataset; (2) P-value(UIPS): The p-value under the t-test between UIPS and the best DR baseline on each dataset. From Table 8 and Table 9 , we can first observe that DR cannot consistently outperform IPS-Cap: It outperforms IPS-Cap on the Coat and KuaiRec dataset, while achieving much worse performance on the synthetic datasets and Yahoo dataset. This is because the imputation model also plays an important role in gradient calculation as shown in Equation( 13), so its accuracy greatly affects policy learning. When the imputation model is sufficiently accurate, for example, on the Coat dataset with only 300 actions, incorporating the DM estimator not only leads to better performance of DR over IPS, but also improved performance of UIPSDR over UIPS. And in particular, in this situation UIPSDR performs better than DR with the gain being statistically significant. When the imputation model is not accurate enough, for example, on the KuaiRec dataset with a large action space but sparse reward feedback, DR is still worse than UIPS, and UIPSDR also performs worse than UIPS due to the distortion of the imputation model. A.4 THEORETICAL PROOF. Proof of Proposition 1: Proof. With the linearity of expectation, we have E D VBIPS (π ϑ ) = E β * π ϑ (a|x) β(a|x) r x,a , thus: Bias VBIPS (π ϑ ) = E D VBIPS (π ϑ ) -V (π ϑ ) = E β * π ϑ (a|x) β(a|x) r x,a -E β * π ϑ (a|x) β * (a|x) r x,a = E β * π ϑ (a|x) β * (a|x) r x,a β * (a|x) β(a|x) -1 = E π ϑ r x,a β * (a|x) β(a|x) -1 . τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 IPS-Cap 0.5515±2e -3 0.1553 ±8e -4 0.6031±2e -3 0.5526±2e -3 0.1561±6e -4 0.6016±1e -3 0.5409±3e -3 0.1529±9e -4 0.5901±2e -3 UIPS 0.5589±3e -3 0.1583±9e -4 0.6095±3e -3 0.5572±2e -3 0.1571±8e -4 0.6074±2e -3 0.5432±3e -3 0.1534±8e -4 0.5946±2e -3 DR 0.3846±3e -2 0.1082±8e -3 0.4684±3e -2 0.3631±3e -2 0.1017±9e -3 0.4494±3e -2 0.3560±3e -2 0.0995±7e -3 0.4470±2e -3 MinVarDR 0.3212±3e -2 0.0908±8e -3 0.4062±3e -2 0.3240±5e -2 0.0903±1e -2 0.3905±5e -2 0.3234±5e -2 0.0910±1e -2 0.4059±4e -2 ShrinkageDR 0.4139±2e -2 0.1161±7e -3 0.4969±3e -2 0.3944±3e -2 0.1101±8e -3 0.4797±2e -2 0.4080±3e -2 0.1135±7e -3 0.4901±2e -2 UIPSDR 0.4278±2e -2 0.1200±6e -3 0.5069±2e -2 0.4008±2e -2 0.1126±7e -3 0.4847±2e -2 0.4144±2e -2 0.1162±8e -3 0.4972±2e -2 P-value(UIPSDR) 2e -1 2e -1 3e -1 6e -1 4e -1 6e -1 6e -1 4e -1 5e -1 P-value(UIPS) 6e -13 4e -13 4e -12 2e -12 1e -12 5e -12 8e -12 8e -12 2e -11 Yahoo Coat KuaiRec Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@50 R@50 NDCG@50 IPS-Cap 0.2751±2e -3 0.7419±8e -3 0.5928±7e -3 0.2758±6e -3 0.4582±7e -3 0.4399±9e -3 0.8750±3e -3 0.0238±7e -5 0.8788±5e -3 UIPS 0.2868±2e -3 0.7742±5e -3 0.6274±5e -3 0.2877±3e -3 0.4757±5e -3 0.4576±8e -3 0.9120±1e -3 0.0250±5e -5 0.9174±7e -4 DR 0.2670±2e -3 0.7174±6e -3 0.5636±6e -3 0.2884±3e -3 0.4760±5e -3 0.4541±5e -3 0.8794±1e -2 0.0240±5e -4 0.8824±2e -2 MinVarDR 0.2272±5e -3 0.5989±1e -2 0.4525±1e -2 0.2704±4e -3 0.4434±9e -3 0.4137±6e -3 0.8640±7e -3 0.0235±2e -4 0.8657±7e -3 ShrinkageDR 0.2697±2e -3 0.7226±6e -3 0.5713±5e -3 0.2895±4e -3 0.4749±6e -3 0.4526±6e -3 0.8778±2e -2 0.0239±5e -4 0.8800±2e -2 UIPSDR 0.2721±1e -3 0.7294±6e -3 0.5750±5e -3 0.2946±4e -3 0.4854±8e -3 0.4647±8e -3 0.8849±1e -2 0.0242±4e -4 0.8896±1e For variance, since samples are independently sampled from logging policy, thus : -2 P-value(UIPSDR) 1e -2 2e -2 1e -1 7e -3 5e -3 2e -3 4e -1 4e -1 3e -1 P-value(UIPS) 1e -12 6e -14 6e -15 3e -1 8e -1 1e -1 2e -6 2e -6 1e -3 Var D VBIPS (π ϑ ) = 1 N Var β * π ϑ (a|x) β(a|x) r x,a . By re-scaling, we get: N • Var D VBIPS (π ϑ ) = Var β * π ϑ (a|x) β(a|x) r x,a = E β * π ϑ (a|x) 2 β(a|x) 2 r 2 x,a -E β * π ϑ (a|x) β(a|x) r x,a 2 = E π ϑ π ϑ (a|x) β * (a|x) • β * (a|x) 2 β(a|x) 2 r 2 x,a -E π ϑ β * (a|x) β(a|x) r x,a 2 = Var π ϑ β * (a|x) β(a|x) r x,a + E π ϑ π ϑ (a|x) β * (a|x) -1 • β * (a|x) 2 β(a|x) 2 r 2 x,a . Then we complete the proof.

Proof of Theorem 1:

Proof. We can get: MSE VUIPS (π ϑ ) = E D VUIPS (π ϑ ) -V (π ϑ ) 2 = E D VUIPS (π ϑ ) -V (π ϑ ) 2 + Var D VUIPS (π ϑ ) -V (π ϑ ) = E D VUIPS (π ϑ ) -V (π ϑ ) 2 + Var D VUIPS (π ϑ ) = Bias( VUIPS (π ϑ )) 2 + Var( VUIPS (π ϑ )). We first bound the bias term: Bias( VUIPS (π ϑ )) = E D VUIPS (π ϑ ) -V (π ϑ ) = E β * π ϑ (a|x) β(a|x) ϕ x,a r x,a -V (π ϑ ) (1) = E β * π ϑ (a|x) β(a|x) ϕ x,a r x,a -π ϑ (a|x) β * (a|x) r x,a = E β * r x,a π ϑ (a|x) β * (a|x) • β * (a|x) β(a|x) ϕ x,a -1 ≤ E π ϑ r 2 x,a π ϑ (a|x) β * (a|x) • E β * β * (a|x) β(a|x) ϕ x,a -1 2 (2) Equality (1) follows the linearity of expectation. Inequality ( 2) is due to the Cauchy-Schwarz inequality. We then bound the variance term: Var( VUIPS (π ϑ )) = 1 N Var β * π ϑ (a|x) β(a|x) ϕ x,a r x,a = 1 N E β * π ϑ (a|x) 2 β(a|x) 2 ϕ 2 x,a r 2 x,a -E β * π ϑ (a|x) β(a|x) ϕ x,a r x,a 2 (16) ≤ 1 N E β * π ϑ (a|x) 2 β(a|x) 2 ϕ 2 x,a r 2 x,a ≤ E β * π ϑ (a|x) 2 β(a|x) 2 ϕ 2 x,a Combining the bound of bias and variance, we can complete the proof.

Proof of Theorem 2:

Proof. We first define several notations: a|x) 2 β(a|x) 2 ϕ 2 x,a . • T (ϕ x,a , β x,a ) = λE β * βx,a β(a|x) ϕ x,a -1 2 + E β * π ϑ ( • T (ϕ x,a ) = max βx,a∈Bx,a T (ϕ x,a , β x,a ) denotes the maximum value of inner problem. • T * = min ϕx,a T (ϕ x,a ) = min ϕx,a max βx,a∈Bx,a T (ϕ x,a , β x,a ) denote the optimal minmax value. And ϕ * x,a = arg min ϕx,a T (ϕ x,a ) . We first find the maximum value of inner problem, i.e., T (ϕ x,a ) for any fixed ϕ x,a . And there are three cases shown in Figure 4 : , then T (ϕ x,a ) will be the maximum between T (ϕ x,a , B - x,a ) and T (ϕ x,a , B + x,a ). More specifically, when , T (ϕ x,a ) = T (ϕ x,a , B + x,a ). Otherwise when ϕ x,a < 2 β(a|x) B + x,a +B - x,a , T (ϕ x,a ) = T (ϕ x,a , B - x,a ). Case III: When ϕ x,a ≥ β(a|x) B - x,a . implying β(a|x) ϕx,a ≤ B - x,a , T (ϕ x,a ) = T (ϕ x,a , B + x,a ). Overall, we get that:  Next we try to find the minimum value of T (ϕ x,a ). We first observe that without considering constraint on ϕ x,a , when ). Usually we set η 1 ≤ η 2 . We introduce two parameters since the scale of η 1 is closely related to the scale of λ, while the scale of η 2 is independent. Then we complete the proof . Lemma 1. Assume η 1 ≤ η 2 , then with fixed π ϑ (a|x) and β(a|x), and α x,a = ≤ 0. Given U x,a ≥ 0, this implies that ϕ * x,a will always decrease as U x,a increases in this case. Otherwise, when α x,a ≤ π ϑ (a|x) β(a|x) ≤ larger U x,a implies larger ϕ x,a . Otherwise ϕ * x,a still decreases as U x,a increases. This completes the proof.



https://webscope.sandbox.yahoo.com/ https://www.cs.cornell.edu/ ˜schnabts/mnar/



Figure 1: Estimated logging policy and its uncertainty under different item frequency on KuaiRec.

; Zhou et al. (2020); Abbasi-Yadkori et al. (2011)). Recently, several work on offline reinforcement learning (Wu et al. (2021); An et al. (2021); Bai et al. (

a) regarding to the last layer. Table 6: Notations Computation Cost. The additional computation cost of UIPS over IPS comes from two parts: • Pre-calculating uncertainties (line 1-5 in Algorithm 1) : This part calculates uncertainty of the logging probability for each (s, a) pair, and "it only needs to be executed once". The computational cost of this step is O(N d 2 + d 3 ), where O(N d 2 ) is for calculating uncertainties in each (s, a) pair and O(d 3 ) is for matrix inverse. • Calculating ϕ * x,a during training (line 8 in Algorithm 1): It only takes O(1) time, the same computational cost as calculating IPS score.

Figure 3: Model architecture of the logging and the learning policy in real-world datasets

• B - x,a := Ẑ exp(-γUx,a) Z * β(a|x), and B + x,a := Ẑ exp(γUx,a) Z * β(a|x).

Figure 4: Three cases for maximizing inner problem.

x,a , B - x,a ), ϕ x,a ∈ (-∞, 2 β(a|x)

a|x) 2 β(a|x) 2 Ẑ exp(γUx,a ) Z * , T (ϕ x,a , B +x,a ) achieves the global minimum value. However, ϕ + x,a ≤ β(a|x) ∞), the minimum value of T (ϕ x,a , B + x,a ) achieves at 2 β(a|x) hand, without considering any constraint on ϕ x,a , global minimum value of T (ϕ x,a , B - x,a ) achieves at: η 1 and η 2 to represent Z * Ẑ in the two terms respectively. We can get since Ẑ exp(-γU max s ) ≤ Z * = a ′ exp(f θ * (a ′ |x)) ≤ Ẑ exp(U max s

2γU x,a ), we have the following observations:• If π ϑ (a|x) β(a|x) ≤ α x,a , ϕ * x,a = 2η 2 / [exp(γU x,a ) + exp(-γU x,a )]. Otherwise ϕ * x,a = λ/ λ η1 exp (-γU x,a ) + η1π ϑ (a|x) 2 β2 (a|x) exp(-γUx,a)] . In other words, ϕ * x,a ≤ 2η 2 always holds.• If π ϑ (a|x) β(a|x) ≥ √ λ η1, then ϕ * s,a will always decrease as U x,a increases.• If α x,a ≤ π ϑ (a|x) β(a|x) < √ λ η1 exp(-γU x,a ), larger U x,a brings larger ϕ * x,a . Otherwise ϕ * x,a still decreases as U x,a increases. Proof. For the first observation, deriving λ λ η1 exp (-γU x,a ) + η1π ϑ (a|x) 2 β2 (a|x) exp(-γUx,a) ≤ 2η 2 exp (γU x,a ) + exp (-γU x,a ) we can get the result.For the second and third observation, since η 1 ≤ η 2 , then α x,a ≤ γu) + η1π ϑ (a|x) 2 β(a|x) 2 exp(-γu) , we can have:∇ u L(u) = -γ λ η 1 exp(-γu) + γ η 1 π ϑ (a|x) 2 β(a|x) 2 exp(γu) By letting ∇ u L(u) ≥ 0, we can get u ≥ 1 γ log √ λ β(a|x) η1π ϑ (a|x) . This implies when Ux,a ≥ 1 γ log √ λ β(a|x)η1π ϑ (a|x) , ϕ * x,a will decrease as U x,a Otherwise as U x,a increases, ϕ * x

• SNIPS(Swaminathan & Joachims (2015c)), BanditNet(Joachims et al. (2018)),POEM (Swaminathan & Joachims (2015b)), POXM(Lopez et al. (2021)), Adaptive(Liu et al. (2022)): This line of work aims for more stable and accurate policy learning. For example, SNIPS normalizes the estimator by the sum of propensity scores in each batch. BanditNet extends SNIPS and leverages an additional Lagrangian term to normalize the estimator by an approximated sum of propensity scores of all samples. POEM jointly optimizes the estimator and its variance. POXM controls estimation variance by pruning samples with small logging probabilities. Adaptive proposes a new formulation to utilize negative samples.• UIPS-P and UIPS-O : Two variants of our proposed UIPS with different ways of leveraging uncertainties.

Performance under different uncertainties.

Effect of λ and γ on NDCG@5.

MSE

Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492-11502. PMLR, 2020. Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1059-1068, 2018.

The statistics of three real-world datasets.

Experiment results on synthetic datasets. The best and second best results are highlighted with bold

Experimental results on real-world unbiased datasets. The best and second best results are highlighted with bold and underline respectively. Two p-values are calculated: (1) P-value(UIPSDR): The p-value under the t-test between UIPSDR and the best DR baseline on each dataset; (2) P-value(UIPS): The p-value under the t-test between UIPS and the best DR baseline on each dataset.

β(a|x)  ϕx,a ≤

