Learning Deep Features in Instrumental Variable Regression

Abstract

Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by using an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.

1. Introduction

The aim of supervised learning is to obtain a model based on samples observed from some data generating process, and to then make predictions about new samples generated from the same distribution. If our goal is to predict the effect of our actions on the world, however, our aim becomes to assess the influence of interventions on this data generating process. To answer such causal questions, a supervised learning approach is inappropriate, since our interventions, called treatments, may affect the underlying distribution of the variable of interest, which is called the outcome. To answer these counterfactual questions, we need to learn how treatment variables causally affect the distribution process of outcomes, which is expressed in a structural function. Learning a structural function from observational data (that is, data where we can observe, but not intervene) is known to be challenging if there exists an unmeasured confounder, which influences both treatment and outcome. To illustrate: suppose we are interested in predicting sales of airplane tickets given price. During the holiday season, we would observe the simultaneous increase in sales and prices. This does not mean that raising the price causes the sales to increase. In this context, the time of the year is a confounder, since it affects both the sales and the prices, and we need to correct the bias caused by it. One way of correcting such bias is via instrumental variable (IV) regression (Stock and Trebbi, 2003) . Here, the structural function is learned using instrumental variables, which only affect the treatment directly but not the outcome. In the sales prediction scenario, we can use supply cost shifters as the instrumental variable since they only affect the price (Wright, 1928; Blundell et al., 2012) . Instrumental variables can be found in many contexts, and IV regression is extensively used by economists and epidemiologists. For example, IV regression is used for measuring the effect of a drug in the scenario of imperfect compliance (Angrist et al., 1996) , or the influence of military service on lifetime earnings (Angrist, 1990) . In this work, we propose a novel IV regression method, which can discover non-linear causal relationships using deep neural networks. Classically, IV regression is solved by the two-stage least squares (2SLS) algorithm; we learn a mapping from the instrument to the treatment in the first stage, and learn the structural function in the second stage as the mapping from the conditional expectation of the treatment given the instrument (obtained from stage 1) to the outcome. Originally, 2SLS assumes linear relationships in both stages, but this has been recently extended to non-linear settings. One approach has been to use non-linear feature maps. Sieve IV performs regression using a dictionary of nonlinear basis functions, which increases in size as the number of samples increases (Newey and Powell, 2003; Blundell et al., 2007; Chen and Pouzo, 2012; Chen and Christensen, 2018) . Kernel IV (KIV) (Singh et al., 2019) and Dual IV regression (Muandet et al., 2020) use different (and potentially infinite) dictionaries of basis functions from reproducing kernel Hibert spaces (RKHS). Although these methods enjoy desirable theoretical properties, the flexibility of the model is limited, since all existing work uses pre-specified features. Another approach is to perform the stage 1 regression through conditional density estimation (Carrasco et al., 2007; Darolles et al., 2011; Hartford et al., 2017) . One advantage of this approach is that it allows for flexible models, including deep neural nets, as proposed in the DeepIV algorithm of (Hartford et al., 2017) . It is known, however, that conditional density estimation is costly and often suffers from high variance when the treatment is high-dimensional. More recently, Bennett et al. ( 2019) have proposed DeepGMM, a method inspired by the optimally weighted Generalized Method of Moments (GMM) (Hansen, 1982) to find a structural function ensuring that the regression residual and the instrument are independent. Although this approach can handle high-dimensional treatment variables and deep NNs as feature extractors, the learning procedure might not be as stable as the 2SLS approach, since it involves solving a smooth zero-sum game, as when training Generative Adversarial Networks (Wiatrak et al., 2019) . In this paper, we propose Deep Feature Instrumental Variable Regression (DFIV), which aims to combine the advantages of all previous approaches, while avoiding their limitations. In DFIV, we use deep neural nets to adaptively learn feature maps in the 2SLS approach, which allows us to fit highly nonlinear structural functions, as in DeepGMM and DeepIV. Unlike DeepIV, DFIV does not rely on conditional density estimation. Like sieve IV and KIV, DFIV learns the conditional expectation of the feature maps in stage 1 and uses the predicted features in stage 2, but with the additional advantage of learned features. We empirically show that DFIV performs better than other methods on several IV benchmarks, and apply DFIV successfully to off-policy policy evaluation, which is a fundamental problem in Reinforcement Learning (RL). The paper is structured as follows. In Section 2, we formulate the IV regression problem and introduce two-stage least-squares regression. In Section 3, we give a detailed description of our DFIV method. We demonstrate the empirical performance of DFIV in Section 4, covering three settings: a classical demand prediction example from econometrics, a challenging IV setting where the treatment consists of high-dimensional image data, and the problem of off-policy policy evaluation in reinforcement learning.

