UNCERTAINTY-AWARE OFF POLICY LEARNING

Abstract

Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various real-world applications, such as search engines, recommender systems, etc. While the ground-truth logging policy, which generates the logged data, is usually unknown, previous work directly takes its estimated value in off-policy learning, resulting in a biased estimator. This estimator has both high bias and variance on samples with small and inaccurate estimated logging probabilities. In this work, we explicitly model the uncertainty in the estimated logging policy and propose a novel Uncertaintyaware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.

1. INTRODUCTION

In many real-world applications, including search engines (Agarwal et al. (2019) ), online advertisements (Strehl et al. (2010) ), recommender systems (Chen et al. (2019) ; Liu et al. (2022) ), only logged feedback data is available for subsequent policy optimization. For example, in recommender systems, various complex recommendation models (i.e., policies) (Zhou et al. (2018) ; Guo et al. ( 2017)) were optimized with logged user interactions (e.g., clicks or staytime) to items recommended by previous recommendation policies. However, such logged data is known to be biased, since one does not know the feedback on items that previous policy (which is generally referred as the logging policy) did not take. This inevitably distorts the evaluation and optimization of a new policy when it tends to select items that are not in the logged data. Off-policy learning (Thrun & Littman (2000) ; Precup (2000) ) emerges as a favorable way to learn an improved policy only from the logged data by addressing the mismatch between the learning policy and the logging policy. One of the most commonly used off-policy learning methods is Inverse Propensity Scoring (IPS) (Chen et al. (2019); Munos et al. (2016) ), which assigns per-sample importance weight to the training objective on the logged data, so as to get an unbiased optimization objective in expectation. The importance weight in IPS is the probability ratio between the learning policy and the logging policy. However, the ground-truth logging policy is unavailable to the learner, e.g., it is not recorded in the data. 2020)) is to first employ a supervised learning method (e.g., logistic regression, neural networks, etc.) to estimate the logging policy, and then take the estimated logging policy for off-policy learning. We theoretically show that such an approximation results in a biased estimator which is sensitive to those inaccurate and small estimated logging probabilities. Worse still, the small values of the estimated logging probabilities usually mean that there are fewer related samples in the logged data, so its estimation usually has high uncertainties, i.e., inaccurate estimation with high probability. Figure 1 shows a piece of empirical evidence from a large-scale recommendation benchmark KuaiRec dataset (Gao et al. ( 2022)), where items with lower frequencies in the logged dataset have lower estimated logging probabilities and higher uncertainties concurrently. The high bias and variance caused by these samples greatly hinder the performance of off-policy learning. In this work, we explicitly take the uncertainty of the estimated logging policy into consideration and design a novel Ucertainty-aware Inverse Propensity Score estimator (UIPS) as the optimization objective for policy learning. UIPS introduces an additional weight to approach the ground-truth propensity from the estimated one, and learns an improved policy by alternating: (1) Find the optimal weight that makes the estimator as accurate as possible, taking into consideration the uncertainty of the estimated logging policy; (2) Improve the policy by optimizing the resulting estimator. We further find a closed-form solution for the optimal weight by deriving an upper bound on the mean squared error (MSE) to the ground-truth policy value. The optimal weight adjusts sample weights considering both the uncertainty of estimated logging probabilities and the propensity scores, rather than simply boosting or penalizing samples with high uncertain logging probabilities. Experiment results on the synthetic and three real-world recommendation datasets demonstrate the efficiency of UIPS. All data and code can be found in supplementary materials for reproducibility. To summarize, our contribution in this work is as follows: • We point out that directly using the estimated logging policy leads to sub-optimal off-policy learning, since the resulting biased estimator is greatly distorted by samples with inaccurate and small estimated logging probabilities. • We take the uncertainty of the estimated logging policy into consideration and propose UIPS for more accurate off-policy learning. • Experiments on synthetic and three real-world recommendation datasets demonstrate UIPS's strong advantage over state-of-the-art methods.

2. PRELIMINARY: OFF-POLICY LEARNING

We focus on the standard contextual bandit setup to explain the key concepts. Following convention (Joachims et al. (2018) ; Saito & Joachims (2022); Su et al. ( 2020)), let x ∈ X ⊆ R d be a ddimensional context vector drawn from an unknown probability distribution p(x). Each context is associated with a finite set of actions denoted by A, where |A| < ∞. Let π : A × X → [0, 1] denote a stochastic policy, such that π(a|x) is the probability of selecting action a under context x and a∈A π(a|x) = 1. Under a given context, reward r x,a is observed when action a is chosen. Take news recommendation for example, x represents the state of a user, summarizing his/her interaction history with the recommender system, each action a is a candidate news article, the policy is a recommendation algorithm, and the reward r x,a denotes the user feedback on article a, e.g., whether the user clicks the article. Let V (π) denote the expected reward or value of the policy π: V (π) = E x∼p(x),a∼π(a|x) [r x,a ]. We look for a policy π(a|x) to maximize V (π). In the rest of the paper, we denote E x∼p(x),a∼π(a|x) [•] as E π [•] for simplicity. In contrast to performing online updates by following the learning policy π(a|x), in off-policy learning we can only access a set of logged feedback data denoted by D := {(x n , a n , r xn,an )|n ∈ [N ]}, where [N ] := {1, . . . , N }. Given x n , the action a n was generated by a stochastic logging policy β * , i.e., the probability action a n was selected is β * (a n |x n ). The actions {a 1 , . . . , a N } and their corresponding rewards {r x1,a1 , . . . , r x N ,a N } are generated independently given β * . Due to the nature of policy optimization, the learning policy π(a|x) is expected to be different from β * (a|x), unless β * (a|x) is already optimal. Moreover, in practice the situation could be further complicated. Again, consider the news recommendation scenario. Due to the scalability requirement, industrial recommender systems usually adopt a two-stage framework (Ma et al. ( 2020)), where one or several



One common treatment taken by previous work (Strehl et al. (2010); Liu et al. (2022); Chen et al. (2019); Ma et al. (

Figure 1: Estimated logging policy and its uncertainty under different item frequency on KuaiRec.

