EFFICIENT CONDITIONALLY INVARIANT REPRESEN-TATION LEARNING

Abstract

We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features φ(X) of data X to estimate a target Y , while being conditionally independent of a distractor Z given Y . Both Z and Y are assumed to be continuous-valued but relatively low dimensional, whereas X and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from Y to kernelized features of Z, which can be done in advance. It is then only necessary to enforce independence of φ(X) from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if φ(X) ⊥ ⊥ Z | Y . In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features. * Equal contribution. † Code for image data experiments is available at github.com/namratadeka/circe We begin by providing a general-purpose characterization of conditional independence. We then introduce CIRCE, a conditional independence criterion based on this characterization, which is zero if and only if conditional independence holds (under certain required conditions). We provide a finite sample estimate with convergence guarantees, and strategies for efficient estimation from data.

1. INTRODUCTION

We consider a learning setting where we have labels Y that we would like to predict from features X, and we additionally observe some metadata Z that we would like our prediction to be 'invariant' to. In particular, our aim is to learn a representation function φ for the features such that φ(X) ⊥ ⊥ Z | Y . There are at least three motivating settings where this task arises. 1. Fairness. In this context, Z is some protected attribute (e.g., race or sex) and the condition φ(X) ⊥ ⊥ Z | Y is the equalized odds condition (Mehrabi et al., 2021) . 2. Domain invariant learning. In this case, Z is a label for the environment in which the data was collected (e.g., if we collect data from multiple hospitals, Z i labels the hospital that the ith datapoint is from). The condition φ(X) ⊥ ⊥ Z | Y is sometimes used as a target for invariant learning (e.g., Long et al., 2018; Tachet des Combes et al., 2020; Goel et al., 2021; Jiang & Veitch, 2022) . Wang & Veitch (2022) argue that this condition is well-motivated in cases where Y causes X. 3. Causal representation learning. Neural networks may learn undesirable "shortcuts" for their tasks -e.g., classifying images based on the texture of the background. To mitigate this issue, various schemes have been proposed to force the network to use causally relevant factors in its decision (e.g., Veitch et al., 2021; Makar et al., 2022; Puli et al., 2022) . The structural causal assumptions used in such approaches imply conditional independence relationships between the features we would like the network to use, and observed metadata that we may wish to be invariant to. These approaches then try to learn causally structured representations by enforcing this conditional independence in a learned representation. In this paper, we will be largely agnostic to the motivating application, instead concerning ourselves with how to learn a representation φ that satisfies the target condition. Our interest is in the (common) case where X is some high-dimensional structured data -e.g., text, images, or video -and we would like to model the relationship between X and (the relatively low-dimensional) Y, Z using a neural network representation φ(X). There are a number of existing techniques for learning conditionally invariant representations using neural networks (e.g., in all the motivating applications mentioned above). Usually, however, they rely on the labels Y being categorical with a small number of categories. We develop a method for conditionally invariant representation learning that is effective even when the labels Y and attributes Z are continuous or moderately high-dimensional. To understand the challenge, it is helpful to contrast with the task of learning a representation φ satisfying the marginal independence φ(X) ⊥ ⊥ Z. To accomplish this, we might define a neural network to predict Y in the usual manner, interpret the penultimate layer as the representation φ, and then add a regularization term that penalizes some measure of dependence between φ(X) and Z. As φ changes at each step, we'd typically compute an estimate based on the samples in each mini-batch (e.g., Beutel et al., 2019; Veitch et al., 2021) . The challenge for extending this procedure to conditional invariance is simply that it's considerably harder to measure. More precisely, as conditioning on Y "splits" the available data,foot_0 we require large samples to assess conditional independence. When regularizing neural network training, however, we only have the samples available in each mini-batch: often not enough for a reliable estimate. The main contribution of this paper is a technique that reduces the problem of learning a conditionally independent representation to the problem of learning a marginally independent representation, following a characterization of conditional independence due to Daudin (1980) . We first construct a particular statistic ζ(Y, Z) such that enforcing the marginal independence φ(X) et al., 2009; Grunewalder et al., 2012; Park & Muandet, 2020; Li et al., 2022) . This makes CIRCE a suitable regularizer for any setting where the conditional independence relation φ(X) ⊥ ⊥ Z | Y should be enforced when learning φ(X). In particular, the learned relationship between Z and Y doesn't depend on the mini-batch size, sidestepping the tension between small mini-batches and the need for large samples to estimate conditional dependence. Our paper proceeds as follows: in Section 2, we introduce the relevant characterization of conditional independence from (Daudin, 1980), followed by our CIRCE criterion -we establish that CIRCE is indeed a measure of conditional independence, and provide a consistent empirical estimate with finite sample guarantees. Next, in Section 3, we review alternative measures of conditional dependence. Finally, in Section 4, we demonstrate CIRCE in two practical settings: a series of counterfactual invariance benchmarks due to Quinzan et al. (2022) , and image data extraction tasks on which a "cheat" variable is observed during training. ⊥ ⊥ ζ(Y, Z) is (approximately) equivalent to enforcing φ(X) ⊥ ⊥ Z | Y .



If Y is categorical, naively we would measure a marginal independence for each level of Y .



The construction is straightforward: given a fixed feature map ψ(Y, Z) on Y × Z (which may be a kernel or random Fourier feature map), we define ζ(Y, Z) as the conditionally centered features, ζ(Y, Z) = ψ(Y, Z) -E[ψ(Y, Z) | Y ]. We obtain a measure of conditional independence, the Conditional Independence Regression CovariancE (CIRCE), as the Hilbert-Schmidt Norm of the kernel covariance between φ(X) and ζ(Y, Z). A key point is that the conditional feature mean E[ψ(Y, Z) | Y ] can be estimated offline, in advance of any neural network training, using standard methods (Song

