COUNTERFACTUAL SELF-TRAINING

Abstract

Unlike traditional supervised learning, in many settings only partial feedback is available. We may only observe outcomes for the chosen actions, but not the counterfactual outcomes associated with other alternatives. Such settings encompass a wide variety of applications including pricing, online marketing and precision medicine. A key challenge is that observational data are influenced by historical policies deployed in the system, yielding a biased data distribution. We approach this task as a domain adaptation problem and propose a self-training algorithm which imputes outcomes with finite discrete values for finite unseen actions in the observational data to simulate a randomized trial. We offer a theoretical motivation for this approach by providing an upper bound on the generalization error defined on a randomized trial under the self-training objective. We empirically demonstrate the effectiveness of the proposed algorithms on both synthetic and real datasets.

1. INTRODUCTION

Counterfactual inference (Pearl et al., 2000) attempts to address a question central to many applications -What would be the outcome had an alternative action was chosen? It may be selecting relevant ads to engage with users in online marketing (Li et al., 2010) , determining prices that maximize profit in revenue management (Bertsimas & Kallus, 2016) , or designing the most effective personalized treatment for a patient in precision medicine (Xu et al., 2016) . With observational data, we have access to past actions, their outcomes, and possibly some context, but in many cases not the complete knowledge of the historical policy which gave rise to the action (Shalit et al., 2017) . Consider a pricing setting in the form targeted promotion. We might record information of a customer (context), promotion offered (action) and whether an item was purchased (outcome), but we do not know why a particular promotion was selected. Unlike traditional supervised learning, we only observe feedback for the chosen action in observational data, but not the outcomes associated with other alternatives (i.e., in the pricing example, we do not observe what would occur if a different promotion was offered). In contrast to the gold standard of a randomized controlled trial, observational data are influenced by historical policy deployed in the system which may over or under represent certain actions, yielding a biased data distribution. A naive but widely used approach is to learn a machine learning algorithm directly from observational data and use it for prediction. This is often referred to as direct method (DM) (Dudík et al., 2014) . Failure to account for the bias introduced by historical policy often results in an algorithm which has high accuracy on the data it was trained on, but performs considerably worse under a different policy. For example in the pricing setting, if historically most customers who received high promotion offers bear a certain profile, then a model based on direct method may fail to produce reliable predictions on these customers when low offers are given. To overcome the limitations of direct method, Shalit et al. (2017); Johansson et al. (2016); Lopez et al. (2020) cast counterfactual learning as a domain adaptation problem, where the source domain is observational data and the target domain is a randomized trial whose assignment of actions follows a uniform distribution for a given context. The key idea is to map contextual features to an embedding space and jointly learn a representation that encourages similarity between these two domains, leading to better counterfactual inference. The embedding is generally learned by a neural network and the estimation of the domain gap is usually slow to compute.

