SELF-SUPERVISED OFF-POLICY RANKING VIA CROWD LAYER

Abstract

Off-policy evaluation (OPE) aims to estimate the online performance of target policies given dataset collected by some behavioral policies. OPE is crucial in many applications where online policy evaluation is expensive. However, existing OPE methods are far from reliable. Fortunately, in many real-world scenarios, we care only about the ranking of the evaluating policies, rather than their exact online performance. Existing works on off-policy ranking (OPR) adopt a supervised training paradigm, which assumes that there are plenty of deployed policies and the labels of their performance are available. However, this assumption does not apply to most OPE scenarios because collecting such training data might be highly expensive. In this paper, we propose a novel OPR framework called SOC-CER, where the existing OPE methods are modeled as workers in a crowdsourcing system. SOCCER can be trained in a self-supervised way as it does not require any ground-truth labels of policies. Moreover, in order to capture the relative discrepancies between policies, we propose a novel transformer-based architecture to learn effective pairwise policy representations. Experimental results show that SOCCER achieves significantly high accuracy in a variety of OPR tasks. Surprisingly, SOCCER even performs better than baselines trained in a supervised way using additional labeled data, which further demonstrates the superiority of SOCCER in OPR tasks.

1. INTRODUCTION

Off-policy evaluation (OPE) aims to estimate online performance of given policies using only historical data collected by some other behavior policies. It is crucial to deploying reinforcement learning (RL) to real-world applications, such as trading, advertising, autonomous vehicles and drug trials, where online policy evaluation might be highly expensive. OPE also becomes increasingly important in causal inference and model selection for offline RL (Oberst & Sontag, 2019; Nie et al., 2021) . Most existing works on OPE focus on estimating the online performance of target policies and can be categorized into three classes: Inverse Propensity Scoring (IPS) based methods, Direct Methods (DM) and Hybrid Methods (HM). Unfortunately, existing OPE methods are far from reliable in real applications. Standard IPS based estimators such as importance sampling suffer from high variance due to the product of importance weights (Hanna et al., 2019) . DM requires extra estimators of environmental dynamics or value functions, which are hard to learn when the observation data is high-dimensional or insufficient. HM such as doubly robust estimators combine IPS and DM (Jiang & Li, 2016 ), yet it often comes with additional hyperparameters that need to be carefully chosen. Fortunately, in many real-world scenarios, we do not need to estimate the exact online performance of target policies. Instead, we only care about which policy would perform the best when deployed online. This inspires us to develop a policy ranker that focuses on predicting the ranking of target policies regarding to their online performance. A recent work proposes a policy ranking model called SOPR-T (Jin et al., 2022) , which is trained in a supervised paradigm under the assumption that there are plenty of extra deployed policies whose performance can be used as supervision signals. However, this assumption is impracticable in many real-world OPE tasks since collecting online performance of policies can be extremely expensive. In addition, SOPR-T directly maps the data of state-action pairs to a score, yielding a low-efficient policy representation scheme which fails to capture the relative discrepancies between policies.

