SELF-SUPERVISED OFF-POLICY RANKING VIA CROWD LAYER

Abstract

Off-policy evaluation (OPE) aims to estimate the online performance of target policies given dataset collected by some behavioral policies. OPE is crucial in many applications where online policy evaluation is expensive. However, existing OPE methods are far from reliable. Fortunately, in many real-world scenarios, we care only about the ranking of the evaluating policies, rather than their exact online performance. Existing works on off-policy ranking (OPR) adopt a supervised training paradigm, which assumes that there are plenty of deployed policies and the labels of their performance are available. However, this assumption does not apply to most OPE scenarios because collecting such training data might be highly expensive. In this paper, we propose a novel OPR framework called SOC-CER, where the existing OPE methods are modeled as workers in a crowdsourcing system. SOCCER can be trained in a self-supervised way as it does not require any ground-truth labels of policies. Moreover, in order to capture the relative discrepancies between policies, we propose a novel transformer-based architecture to learn effective pairwise policy representations. Experimental results show that SOCCER achieves significantly high accuracy in a variety of OPR tasks. Surprisingly, SOCCER even performs better than baselines trained in a supervised way using additional labeled data, which further demonstrates the superiority of SOCCER in OPR tasks.

1. INTRODUCTION

Off-policy evaluation (OPE) aims to estimate online performance of given policies using only historical data collected by some other behavior policies. It is crucial to deploying reinforcement learning (RL) to real-world applications, such as trading, advertising, autonomous vehicles and drug trials, where online policy evaluation might be highly expensive. OPE also becomes increasingly important in causal inference and model selection for offline RL (Oberst & Sontag, 2019; Nie et al., 2021) . Most existing works on OPE focus on estimating the online performance of target policies and can be categorized into three classes: Inverse Propensity Scoring (IPS) based methods, Direct Methods (DM) and Hybrid Methods (HM). Unfortunately, existing OPE methods are far from reliable in real applications. Standard IPS based estimators such as importance sampling suffer from high variance due to the product of importance weights (Hanna et al., 2019) . DM requires extra estimators of environmental dynamics or value functions, which are hard to learn when the observation data is high-dimensional or insufficient. HM such as doubly robust estimators combine IPS and DM (Jiang & Li, 2016 ), yet it often comes with additional hyperparameters that need to be carefully chosen. Fortunately, in many real-world scenarios, we do not need to estimate the exact online performance of target policies. Instead, we only care about which policy would perform the best when deployed online. This inspires us to develop a policy ranker that focuses on predicting the ranking of target policies regarding to their online performance. A recent work proposes a policy ranking model called SOPR-T (Jin et al., 2022) , which is trained in a supervised paradigm under the assumption that there are plenty of extra deployed policies whose performance can be used as supervision signals. However, this assumption is impracticable in many real-world OPE tasks since collecting online performance of policies can be extremely expensive. In addition, SOPR-T directly maps the data of state-action pairs to a score, yielding a low-efficient policy representation scheme which fails to capture the relative discrepancies between policies. In this paper, we propose a novel Self-supervised Off-poliCy ranking model based on Crowd layER (SOCCER) to address the above challenges. The novelty of SOCCER is two-fold. First, we employ a crowdsourcing paradigm to solve the OPR problem, where the workers come from a diverse pool of existing OPE methods, who provide labels of whether a policy would perform better than another one. Note that these labels are constructed by comparing the estimated accumulated rewards of the target policies, thus our model can be trained in a self-supervised way. Moreover, we propose a novel Policy Comparison Transformer (PCT) architecture to learn efficient policy representations. Instead of directly mapping the state-action pairs to a policy embedding (as is done in SOPR-T), PCT learns pairwise representation of two policies capturing difference of them at the same set of states. With the help of PCT, our policy ranking model generalizes well in the policy space. Experimental results show that SOCCER not only achieves significant higher ranking performance than existing OPE methods, but also outperforms baselines trained using additional ground-truth labels.

2. RELATED WORKS

Off-policy evaluation/ranking. The goal of OPE is to precisely predict the online performance of target policies given trajectory data collected by some other behavior policies. Standard importance sampling approach suffers from exponential variance with respect to the time horizon (Li et al., 2015; Jiang & Li, 2016) . Recent works such as Fitted-Q evaluation (Hoang et al., 2019) and marginalized importance sampling (Liu et al., 2018) achieve polynomial variance, yet they rely on additional function approximators. Direct methods avoid the large variance by learning the dynamic model or Q-function, which could be biased especially when the data is insufficient. Some works study the hyperparameter-free policy selection problem, yet their method only applies to Q-learning based policies (Zhang & Jiang, 2021). A recent work directly studies the OPR problem, where it collects online performance of a large set of policies and uses these labeled data to train a policy ranker (Jin et al., 2022) . However, collecting such data might be extremely expensive in many applications. Learning from crowds. Crowdsourcing systems enable machine learners to collect labels of large datasets from crowds. One big issue with crowdsourcing systems is that the labels provided by crowds are often noisy (S. & Zhang, 2019). To tackle this challenge, various probabilistic generative methods are proposed for statistical inference (Yuchen et al., 2016; Tian & Zhu, 2015) . Another line of works use discriminative models that find the most likely label for each instance (Jing et al., 2014; 2015) . A recently work called Crowd Layer (CL) first describes an algorithm for jointly learning the target model and the reliability of workers (Filipe & Pereira, 2018) . CL proposes a simple yet efficient crowd layer that can train deep neural networks end-to-end directly from the noisy labels. In our work, we treat existing OPE methods as workers and adopt CL to process multiple noisy labels, because CL is naturally compatible with our model.

Policy representation.

Compact but informative representations of policies not only benefit the policy learning process (Tang et al., 2022) , but also help with the policy transfer among different tasks (Isac et al., 2019; G. et al., 2017) . A straightforward idea is to represent a policy by its network parameters, yet this leads to a very sparse representation space. Network Fingerprint (Harb et al., 2020) proposes a differentiable representation that uses the concatenation of the vectors of actions outputted by the policy network on a set of probing states. Some recent works try to encode policy parameters as well as state-action pair data into a low-dimensional embedding space (Tang et al., 2022; Jin et al., 2022) . However, existing works focus on single policy representations, which fail to capture the relative discrepancies between policies.

