FUNDAMENTAL LIMITS AND TRADEOFFS IN INVARIANT REPRESENTATION LEARNING

Abstract

Many machine learning applications involve learning representations that achieve two competing goals: To maximize information or accuracy with respect to a target while simultaneously maximizing invariance or independence with respect to a subset of features. Typical examples include privacy-preserving learning, domain adaptation, and algorithmic fairness, just to name a few. In fact, all of the above problems admit a common minimax game-theoretic formulation, whose equilibrium represents a fundamental tradeoff between accuracy and invariance. In this paper, we provide an information theoretic analysis of this general and important problem under both classification and regression settings. In both cases, we analyze the inherent tradeoffs between accuracy and invariance by providing a geometric characterization of the feasible region in the information plane, where we connect the geometric properties of this feasible region to the fundamental limitations of the tradeoff problem. In the regression setting, we also derive a tight lower bound on the Lagrangian objective that quantifies the tradeoff between accuracy and invariance. Our results shed new light on this fundamental problem by providing insights on the interplay between accuracy and invariance. These results deepen our understanding of this fundamental problem and may be useful in guiding the design of adversarial representation learning algorithms.

1. INTRODUCTION

One of the fundamental tasks in both supervised and unsupervised learning is to learn proper representations of data for various downstream tasks. Due to the recent advances in deep learning, there has been a surge of interest in learning so-called invariant representations. Roughly speaking, the underlying problem of invariant representation learning is to find a feature transformation of the data that balances two goals simultaneously. First, the features should preserve enough information with respect to the target task of interest, e.g., good predictive accuracy. On the other hand, the representations should be invariant to the change of a pre-defined attribute, e.g., in visual perceptions the representations should be invariant to the change of perspective or lighting conditions, etc. Clearly, in general there is often a tension between these two competing goals of error minimization and invariance maximization. Understanding the fundamental limits and tradeoffs therein remains an important open problem. In practice, the problem of learning invariant representations is often formulated as solving a minimax sequential game between two agents, a feature encoder and an adversary. Under this framework, the goal of the feature encoder is to learn representations that could confuse a worst-case adversary in discriminating the pre-defined attribute. Meanwhile, the representations given by the feature encoder should be amenable for a follow-up predictor of target task. In this paper, we consider the situation where both the adversary and the predictor have infinity capacity, so that the tradeoff between accuracy and invariance solely depends on the representations given by the feature encoder. In particular, our results shed light on the best possible tradeoff attainable by any algorithm. This leads to a Lagrangian objective with a tradeoff parameter between these two competing goals, and we study the fundamental limitations of this tradeoff by analyzing the extremal values of this Lagrangian in both classification and regression settings. Our results shed new light on the fundamental tradeoff between accuracy and invariance, and give a crisp characterization of how the dependence between the target task and the pre-defined attribute affects the limits of representation learning. Contributions We geometrically characterize the tradeoff between accuracy and invariance via the information plane (Shwartz-Ziv & Tishby, 2017) analysis under both classification and regression settings, where each feature transformation correspond to a point on the information plane. For the classification setting, we provide a fundamental characterization of the feasible region in the information plane, including its boundedness, convexity, and extremal vertices. For the regression setting, we provide an analogous characterization of the feasible region by replacing mutual information with conditional variances. Finally, in the regression setting, we prove a tight information-theoretic lower bound on a Lagrangian objective that trades off accuracy and invariance. The proof relies on an interesting SDP relaxation, which may be of independent interest.

Related Work

There are abundant applications of learning invariant representations in various downstream tasks, including domain adaptation (Ben-David et al., 2007; 2010; Ganin et al., 2016; Zhao et al., 2018) , algorithmic fairness (Edwards & Storkey, 2015; Zemel et al., 2013; Zhang et al., 2018; Zhao et al., 2019b) , privacy-preserving learning (Hamm, 2015; 2017; Coavoux et al., 2018; Xiao et al., 2019) , invariant visual representations (Quiroga et al., 2005; Gens & Domingos, 2014; Bouvrie et al., 2009; Mallat, 2012; Anselmi et al., 2016) , and causal inference (Johansson et al., 2016; Shalit et al., 2017; Johansson et al., 2020) , just to name a few. To the best of our knowledge, no previous work studies the particular tradeoff problem in this paper. Closest to our work are results in domain adaptation (Zhao et al., 2019a) and algorithmic fairness (Menon & Williamson, 2018; Zhao & Gordon, 2019) , showing a lower bound on the classification accuracy on two groups, e.g., source vs. target in domain adaptation and majority vs. minority in algorithmic fairness. Compared to these previous results, our work directly characterizes the tradeoff between accuracy and invariance using information-theoretic concepts in both classification and regression settings. Furthermore, we also give an approximation to the Pareto frontier between accuracy and invariance in both cases.

2. BACKGROUND AND PRELIMINARIES

Notation We adopt the usual setup given (X, Y ) ∈ X × Y, where Y is the response, X ∈ R p represents the input vector, and we seek a classification/regression function f (X) that minimizes E (f (X), Y ) where : Y × Y → R is some loss function depending on the context of the underlying problem. In this paper, we consider two typical choices of : (1) the cross entropy loss, i.e. (y, y ) = -y log(y )-(1-y) log(1-y ), which is typically used when Y is a discrete variable in classification; (2) the squared loss, i.e. (y, y ) = (yy ) 2 , which is suitable for Y continuous, as in regression. Throughout the paper, we will assume that all random variables have finite second-order moments. Problem Setup Apart from the input/output pairs, in our setting there is a third variable A, which corresponds to a variable that a predictor should be invariant to. Depending on the particular application, A could correspond to potential protected attributes in algorithmic fairness, e.g., the ethnicity or gender of an individual; or A could be the identity of domain index in domain adaptation, etc. In general, we assume that there is a joint distribution D over the triple (X, A, Y ), from which our observational data are sampled from. Upon receiving the data, the goal of the learner has two folds. On one hand, the learner aims to accurately predict the target Y . On the other hand, it also tries to be insensitive to variation in A. To achieve this dual goal, one standard approach in the literature (Zemel et al., 2013; Edwards & Storkey, 2015; Hamm, 2015; Ganin et al., 2016; Zhao et al., 2018) is through the lens of representation learning. Specifically, let Z = g(X) where g(•) is a (possibly randomized) transformation function that takes X as input and gives the corresponding feature encoding Z. The hope is that, by learning the transformation function g(•), Z contains as much information as possible about the target Y while at the same time filtering out information related to A. This problem is often phrased as an adversarial game: min f,g max f E D [ (f • g(X), Y )] -λ • E D [ (f • g(X), A)], where the two competing agents are the feature transformation g and the adversary f , and λ > 0 is a tradeoff hyperparameter between the task variable Y and the attribute A. For example, the adversary f could be understood as a domain discriminator in applications related to domain adaptation, or an auditor of sensitive attribute in algorithmic fairness. In the above minimax game, the first term corresponds to the accuracy of the target task, and the second term is the loss incurred by the adversary. It is worth pointing out that the minimax problem in (1) is separable for any fixed feature transformation g, in the sense that once g has been fixed, the optimization of f and f are independent of each other. Formally, define R * Y (g) := inf f E D (f (g(X)), Y ) to be the optimal risk in predicting Y using Z = g(X) under loss , and similarly define R * A (g). The separation structure of the problem leads to the following compact form: OPT(λ) := min g R * Y (g) -λ • R * A (g). The minimization here is taken over a family of (possibly randomized) transformations g. Intuitively, (2) characterizes the situation where for a given transformation Z = g(X), both f and f play their optimal responses. Hence this objective function characterizes a fundamental limit of what the best possible representation we can hope to achieve for a fixed value λ. In general, with 0 < λ < ∞, there is an inherent tension between the minimization of R * Y (g) and the maximization of R * A (g), and a choice of the tradeoff hyperparameter λ essentially corresponds to a realization of such tradeoff. Motivating Examples We discuss several examples to which the above framework is applicable. Example 2.1 (Privacy-Preservation). In privacy applications, the goal is to make it difficult to predict sensitive data, represented by the attribute A, while retaining information about Y (Hamm, 2015; 2017; Coavoux et al., 2018; Xiao et al., 2019) . A way to achieve this is to pass information through Z, the "privatized" or "sanitized" data. Example 2.2 (Algorithmic Fairness). In fairness applications, we seek to make predictions about the response Y without discriminating based on the information contained in the protected attributes A. For example, A may represent a protected class of individuals defined by, e.g. race or gender. This definition of fairness is also known as statistical parity in the literature, and has received increasing attention recently from an information-theoretic perspective (McNamara et al., 2019; Zhao & Gordon, 2019; Dutta et al., 2019) . Example 2.3 (Domain Adaptation). In domain adaptation, our goal is to train a predictor using labeled data from the source domain that generalizes to the target domain. In this case, A corresponds to the identity of domains, and the hope here is to learn a domain-invariant representation Z that is informative about the target Y (Ben- David et al., 2007; 2010; Ganin et al., 2016; Zhao et al., 2018) . Example 2.4 (Group Invariance). In many applications in computer vision, it is desirable to learn predictors that are invariant to the action of a group G on the input space. Typical examples include rotation, translation, and scale. By considering random variables A that take their values in G, one approach to this problem is to learn a representation Z that "ignores" changes in A (Quiroga et al., 2005; Gens & Domingos, 2014; Bouvrie et al., 2009; Mallat, 2012; Anselmi et al., 2016) . Example 2.5 (Information bottleneck). The information bottleneck (Tishby et al., 2000) is the problem of finding a representation Z that minimizes the objective I(Z; Y ) -λI(Z; X) in an unsupervised manner. This is closely related to, but not the same as the problem we study, owing to the invariant attribute A.

3. FEASIBLE REGION ON THE INFORMATION PLANE

We begin by defining the feasible region associated with the adversarial game (1) and discussing its relevance to the problem we study. Formally, we define the information plane to be the 2D coordinate plane with axes -R * Y (g) and -R * A (g), respectively. The feasible region then corresponds to the two-dimensional region defined by the pairs (-R * Y (g), -R * A (g)) over all possible representations Z = g(X) on the information plane. More concretely, in the classification and regression settings, these pairs can be given a more intuitive interpretation in terms of mutual information and conditional variances, respectively. In particular, it is easy to show the following: 1. (Classification) Under cross-entropy loss, using standard information-theoretic identities the adversarial game (2) can be rewritten as min Z=g(X) H(Y | Z) -λ • H(A | Z) ⇐⇒ max Z=g(X) I(Y ; Z) -λ • I(A; Z). (3) 2. (Regression) Under the least-squares loss, by the law of total variance, the adversarial game (2) can be rewritten as min Z=g(X) E[Var(Y | Z)]-λ•E[Var(A | Z)] ⇐⇒ max Z=g(X) Var E[Y | Z]-λ•Var E[A | Z]. (4) These equivalences motivate the following definitions: (Classification): R CE := {(I(Y ; Z), I(A; Z)) ∈ R 2 }, (Regression): R LS := {(Var E[Y | Z], Var E[A | Z]) ∈ R 2 }. I(Y ; Z) I(A; Z) H(Y ) H(A) Z ≡ c Z = X E * A E * Y (a) Information plane of classification. Var E[Y | Z] Var E[A | Z] Var(Y ) Var(A) Z ≡ c Z = X E * A E * Y (b) Information plane of regression. Figure 1 : 2D information plane. The shaded area correspond to the feasible region. We call both R CE and R LS the feasible region for the classification and regression settings, respectively. See Fig. 1 for an illustration of the information planes and the feasible regions. At this point, it may not be immediately clear what the relevance of the feasible region is. To see this, recall that our high-level goal is to find representations Z that maximize accuracy (i.e. E D [ (f (Z), Y )]) while simultaneously maximizing invariance (i.e. E D [ (f (Z), A)] ), and consider the four vertices (not necessarily a part of the feasible region) in Fig. 1 . These four corners have intuitive interpretations: • (Red) The so-called "informationless" regime, in which all of the information regarding both Y and A is destroyed. This is achieved by choosing a constant representation Z ≡ c. • (Yellow) Here, we retain all of the information in A while removing all of the information about Y . This is not a particularly interesting regime for the aforementioned applications. • (Blue) The full information regime, where Z = X and no information is lost. This is the "classical" setting, wherein information about A is allowed to leak into Y . • (Green) This is the ideal representation that we would like to attain: we preserve all the relevant information about Y while simultaneously removing all the information about A. Unfortunately, in general, the ideal representation may not be attainable due to the potential correlation between Y and A. As a result, we are interested in characterizing how "close" we can get to attaining this ideal transformation given the distribution over (X, A, Y ). More precisely, we can describe the various extremal points on the boundary of the feasible region as follows: • E * Y : This point corresponds to a representation Z that maximizes accuracy subject to a hard constraint on the invariance (cf. ( 5),( 10)), i.e. there is no leakage of information about A into the representation Z. In classification, we enforce this via the mutual information constraint I(A; Z) = 0 and in regression via the conditional variance constraint Var E[A | Z] = 0.

• E *

A : This point corresponds to a representation Z that maximizes invariance subject to a hard constraint on the accuracy (cf. ( 8), ( 12)), i.e. there is no loss of information about Y in the representation Z. In classification, we enforce this via the mutual information constraint I(Y ; Z) = H(Y ) and in regression we enforce this via the conditional variance constraint Var E[Y | Z] = Var(Y ). • As we vary λ ∈ (0, ∞), we carve out a path OPT(λ) between E * Y and E * A that corresponds to the optimal values of (1). This is the Pareto frontier of the accuracy-invariance tradeoff, and represents the best possible tradeoff attainable for a given λ. Due to the symmetry between Y and A in (2), the feasible regions in both cases are symmetric with respect to the diagonal of the bounding rectangle. With the feasible region more clearly exposed, we can now concretely outline our objective: To analytically characterize the solutions to the extremal problems corresponding to the lower and upper right points on the boundaries, and to provide lower bounds on the objective OPT(λ). Due to the page limit, we defer all the detailed proofs to appendix and mainly focus on providing interpretations and insights of our results in the main paper.

4. CLASSIFICATION

In order to understand the tradeoff between these two competing goals, it is the most interesting to study the case where the original input X contains full information to predict both Y and A, so that any loss of accuracy is not due to the noninformative input X. To this end, our following analysis focuses on the noiseless settingfoot_0 : Assumption 4.1. There exist functions f * Y (•) and f * A (•), such that Y = f * Y (X) and A = f * A (X). In order to characterize the feasible region R CE , first note that from the data processing inequality, the following inequalities hold: 0 ≤ I(Y ; Z) ≤ I(Y ; X) = H(Y ), 0 ≤ I(A; Z) ≤ I(A; X) = H(A) , which means that for any transformation Z = g(X), the point (I(Y ; Z), I(A; Z)) must lie within a rectangle shown in Fig. 2a . The following lemma shows that the feasible region R CE is convex: Lemma 4.1. R CE is convex. Here, the convexity of R CE is guaranteed by a construction of randomized feature transformation. As we briefly discussed before, we know that two vertices of the bounding rectangle are attainable, i.e., the "informationless" origin and the "full information" diagonal vertex. Now with Lemma 4.1, it is clear that all the points on the diagonal of the bounding rectangle are also attainable.

4.1. MAXIMAL MUTUAL INFORMATION UNDER THE INDEPENDENCE CONSTRAINT

In this section we explore the extremal point E * Y . This means that we would like to maximize the mutual information of Z w.r.t Y and simultaneously being independent of A: max Z I(Y ; Z), subject to I(A; Z) = 0. First of all, realize that the optimal solution of (5) clearly depends on the coupling between A and Y . To see this, consider the following two extreme cases: Example 4.1. If A = Y almost surely, then I(A; Z) = 0 directly implies I(Y ; Z) = 0, hence max Z I(Y ; Z) = 0 under the constraint that I(A; Z) = 0. Example 4.2. If A ⊥ Y , then Z = f * Y (X) = Y satisfies the constraint that I(A; Z) = I(A; Y ) = 0. Furthermore, I(Y ; Z) = I(Y ; Y ) = H(Y ) ≥ I(Y ; Z ), ∀Z = Y . Hence max Z I(Y ; Z) = H(Y ). The above two examples show that the optimal solution of (5), if exists analytically, must include a quantity that characterizes the dependency between A and Y . We first define such a quantity: Definition 4.1. Define ∆ Y |A := | Pr D (Y = 1 | A = 0) -Pr D (Y = 1 | A = 1)|. It is easy to verify that the following claims hold about ∆ Y |A : 0 ≤ ∆ Y |A ≤ 1, and ∆ Y |A = 0 ⇐⇒ A ⊥ Y, and ∆ Y |A = 1 ⇐⇒ A = Y or A = 1 -Y. ( ) With this introduced notation, the following theorem gives an analytic solution to (5). Theorem 4.1. The optimal solution of optimization problem (5) is max Z,I(A;Z)=0 I(Y ; Z) = H(Y ) -∆ Y |A • H(A). Let us have a sanity check of this result: First, if A ⊥ Y , then ∆ Y |A = 0, and in this case the optimal solution given by Theorem 4.  1 reduces to H(Y ) -0 • H(A) = H(Y ), H(Y ) -1 • H(A) = H(Y ) -H(Y ) = 0. This is consistent with Example 4.1. Moreover, due to the symmetry between A and Y , we can now characterize the locations of the two extremal points on the lower and left boundaries. The updated figure is plotted in Fig. 2b . In Fig. 2b ∆ A|Y is defined analogously as ∆ Y |A by swapping Y and A.

4.2. MINIMUM MUTUAL INFORMATION UNDER THE SUFFICIENT STATISTICS CONSTRAINT

Next, we characterize the other extremal point, i.e. E * A . Again, by the symmetry between A and Y , it suffices to solve the following optimization problem, whose optimal solution is E * A . min Z I(A; Z), subject to I(Y ; Z) = H(Y ) I(Y ; Z) I(A; Z) H(Y ) H(A) Z ≡ c Z = X (a) Rectangle bounding box. I(Y ; Z) I(A; Z) H(Y ) H(A) Z ≡ c Z = X ∆ Y |A • H(A) ∆ A|Y • H(Y ) θ tan(θ) = ∆ Y |A φ tan(φ) = ∆ A|Y (b) Maximal I(•, •) under the independence constraint. I(Y ; Z) I(A; Z) H(Y ) H(A) Z ≡ c Z = X I(A; Y ) I(A; Y ) θ tan(θ ) = I(A;Y ) H(Y ) φ tan(φ ) = I(A;Y ) H(A) (c) Minimum I(•, •) under the sufficient-stat constraint. I(Y ; Z) I(A; Z) H(Y ) H(A) Z ≡ c Z = X I(A; Y ) I(A; Y ) ∆ Y |A • H(A) ∆ A|Y • H(Y ) θ tan(θ) = ∆ Y |A φ tan(φ) = ∆ A|Y θ tan(θ ) = I(A;Y ) H(Y ) φ tan(φ ) = I(A;Y ) H(A) (d) The convex polygon characterization of RCE.  Clearly, if A and Y are independent, then the gap I(A; Y ) = 0, meaning that we can simultaneously preserve all the target related information and filter out all the information related to A. With the above result, we can now characterize the locations of the remaining two extremal points on the top and right boundaries of bounding rectangle. The updated figure is shown in Fig. 2c .

4.3. THE INFORMATION PLANE IN LEARNING REPRESENTATIONS Z

To get the full picture, we combine our results in Section 4.1 and Section 4.2 and use the fact that R CE must be convex (Lemma 4.1). This allows us to complete the analysis by connecting the black dots on the four boundaries of the bounding rectangle, as shown in Fig. 2d . The feasible region R CE is a convex polygon. Furthermore, both the constrained accuracy optimal solution and the constrained invariance optimal solution can be readily read from Fig. 2d as well. As we mentioned before, ideally we would like to find a representation Z that attains the green vertex of the bounding rectangle. Unfortunately, due to the potential coupling between Y and A, this solution is not always feasible. Nevertheless, it is instructive to see the gaps between the optimal solutions we could hope to achieve and the ideal one: • Maximal information: The gap is given by One open question that we do not answer here is whether the feasible region R CE is strictly convex or not. That is, whether the Pareto-frontier between E * Y and E * A is strictly convex or not? On the other hand, for each value of λ, the line segment connecting E * ∆ Y |A • H(A). On one hand, if A ⊥ Y , then ∆ Y |A = Y and E * A forms a lower bound of Var the Lagrangian. For the aforementioned applications, our approximation of the frontier is critical in order to be able to certify that a given model is not optimal. For example, given some practically computed representation Z and by using known optimal estimators of the mutual information, it is possible to estimate I(A; Y ), I(Y ; Z), I(A; Z) in order to directly bound the (sub)-optimality of Z using Fig. 2d , i.e., how far away the point (I(Y ; Z), I(A; Z)) is from the line segment between E * Y and E * A . This distance lower bounds the distance to the optimal representations on the Pareto frontier. Var E[Y | Z] Var E[A | Z] Var(Y ) Var(A) Z ≡ c Z = X (a) Rectangle bounding box. Var E[Y | Z] Var E[A | Z] Var(Y ) Var(A) Z ≡ c Z = X 2 y,Σa a,a - Var(A)• a,y a,a 2 a, y E[Y | Z] Var E[A | Z] Var(Y ) Var(A) Z ≡ c Z = X Var(Y )• a,y 2 y,y 2 Var(A)• y,a 2 a,a 2 (c) Minimum Variance. Var E[Y | Z] Var E[A | Z] Var(Y ) Var(A) Z ≡ c Z = X Var(Y )• a,y 2 y,y 2 Var(A)• y,a 2 a,a 2 2 y,Σa a,a - Var(A)• a,y a,a 2 a, y

5. REGRESSION

Similar to what we have before, in regression we assume a noiseless setting for better interpretability of our results. The generalization to the noisy setting is included in the appendix. Let H be an RKHS. Assumption 5.1. There exist functions f * Y , f * A ∈ H, such that Y = f * Y (X) and A = f * A (X). Let •, • be the canonical inner product in RKHS H. Under this assumption, there exists a feature map ϕ(X) and a = 0, y = 0, such that Y = f * Y (X) = ϕ(X), y and A = f * A (X) = ϕ(X), a . This feature map does not have to be finite-dimensional, and our analysis works for the case where f * Y , f * A are infinite-dimensional. Next, by the law of total variance, the following inequalities hold: 0 ≤ Var E[Y | Z] ≤ Var E[Y | X] = Var(Y ), 0 ≤ Var E[A | Z] ≤ Var E[A | X] = Var(A), which means that for any transformation Z = g(X), the point (Var E[Y | Z], Var E[A | Z] ) must lie within a rectangle shown in Fig. 3a . To simplify the notation, we define Σ := Cov(ϕ(X), ϕ(X)) to be the covariance operator of ϕ(X). Again, if we consider all the possible feature transformations Z = g(X), then the points (Var E[Y | Z], Var E[A | Z]) will form a feasible region R LS . Similar to what we have in classification, the following lemma shows that the feasible region R LS is convex: Lemma 5.1. R LS is convex. The convexity of R LS is guaranteed by a construction of randomized feature transformation. Similarly, both the "informationless" origin and the "full information" diagonal vertex are attainable. Under review as a conference paper at ICLR 2021

5.1. THE BOUNDING VERTICES ON THE PLANE

In this section we explore the extremal point E * Y and E * A for regression. For E * Y , this means that we would like to maximize the variance of Z w.r.t Y and simultaneously minimizing that of A: max Z Var E[Y | Z], subject to Var E[A | Z] = 0. It is clear that the optimal solution of (10) depends on the coupling between A and Y , and the following theorem precisely characterizes this relationship: Theorem 5.1. The optimal solution of optimization problem ( 10) is upper bounded by max Z,Var E[A|Z]=0 Var E[Y | Z] ≤ Var(Y ) - 2 y, Σa a, a - Var(A) • a, y a, a 2 a, y . Again, let us sanity check this result: First, if a is orthogonal to y, i.e., a, y = 0, then the gap is 0, and the optimal solution becomes Var(Y ). Next, consider the other extreme case where a is parallel to y. In this case it can be readily verified that the optimal solution reduces to 0. With these two results, we can now characterize the locations of the two extremal points on the bottom and left boundaries of bounding rectangle. The updated figure is plotted in Fig. 3b . Similarly, for E * A , it suffices to solve the following problem, whose optimal solution is E * A : min Z Var E[A | Z], subject to Var E[Y | Z] = Var(Y ) ) Theorem 5.2. The optimal solution of optimization problem ( 12) is lower bounded by min Z,Var E[Y |Z]≥Var(Y ) Var E[A | Z] ≥ Var(Y ) • a, y 2 y, y 2 . ( ) Again, if a is orthogonal to y, then the optimal solution is 0, meaning that we can simultaneously preserve all the target variance and filter out all the variance related to A. On the other hand, if a is parallel to y, then Var(Y ) • a, y 2 / y, y 2 = Var(A). The updated plot is shown in Fig. 3c .

5.2. A SPECTRAL LOWER BOUND OF THE LAGRANGIAN

Combining our results in Thm. 5.1 and Thm. 5.2, along with the fact that R LS must be convex, we plot the full picture about the feasible region in the regression setting in Fig. 3d . Both the constrained accuracy optimal solution and the constrained invariance optimal solution can be readily read from Fig. 3d as well. In fact, in the regression setting, we can say even more: We can derive a tight lower bound to the Lagrangian problem OPT(λ ) := min Z=g(X) E[Var(Y | Z)] -λ • E[Var(A | Z)]. Theorem 5.3. The optimal solution of the Lagrangian has the following lower bound: OPT(λ) ≥ 1 2 Var(Y ) -λ • Var(A) -(Var(Y ) + λ • Var(A)) 2 -4λ y, Σa 2 . (14) Evidently, the key quantity in the lower bound ( 14) is the quadratic term y, Σa , which effectively measures the dependence between the Y and A under the feature covariance Σ. The proof of Thm. 5.1, 5.2 and 5.3 rely on a finite-dimensional SDP relaxation, and constructs an explicit optimal solution to this relaxation. We re-formulate the objective as a linear functional of V := Cov(E[ϕ(X) | Z], E[ϕ(X) | Z]), which satisfies the semi-definite constraint 0 V Σ = Cov(ϕ(X), ϕ(X)) . Therefore, the optimal value of the SDP is an upper/lower bound of the objective. Furthermore, we show that under certain regularity conditions, the SDP relaxation is exact. One particularly interesting setting where the regularity condition holds is when ϕ(X) follows a Gaussian distribution. More discussions about the tightness of our bounds are presented in appendix C.5.

6. CONCLUSION

We provide an information plane analysis to study the general and important problem for learning invariant representations in both classification and regression settings. In both cases, we analyze the inherent tradeoffs between accuracy and invariance by providing a geometric characterization of the feasible region on the information plane, in terms of its boundedness, convexity, as well as its its extremal vertices. Furthermore, in the regression setting, we also derive a tight lower bound that for the Lagrangian form of accuracy and invariance. Given the wide applications of invariant representations in machine learning, we believe our theoretical results could contribute to better understandings of the fundamental tradeoffs between accuracy and invariance under various settings, e.g., domain adaptation, algorithmic fairness, invariant visual representations, and privacy-preservation learning.

A PROOFS FOR CLAIMS IN SECTION 3

In this section we give detailed arguments to derive the objective functions of Eq. ( 3) and ( 4) respectively from the original minimax formulation in Eq. ( 1). First, let us consider the classification setting. Classification Given a fixed feature map Z = g(X), due to the symmetry between Y and A in Eq. ( 1), it suffices for us to consider the case of finding f that minimizes E D [ (f • g(X), Y )], and analogous result follows for the case of finding the optimal f that minimizes E D [ (f • g(X), A)] similarly. By definition of the cross-entropy loss, we have: E D [ (f • g(X), Y )] = -E D [I(Y = 0) log(1 -f (g(X))) + I(Y = 1) log(g(f (X)))] = -E D [I(Y = 0) log(1 -f (Z)) + I(Y = 1) log(f (Z))] = -E Z E Y [I(Y = 0) log(1 -f (Z)) + I(Y = 1) log(f (Z)) | Z] = -E Z [Pr(Y = 0 | Z) log(1 -f (Z)) + Pr(Y = 1 | Z) log(f (Z))] = E Z [D KL (Pr(Y | Z) f (Z))] + H(Y | Z) ≥ H(Y | Z). It is also clear from the above proof that the minimum value of the cross-entropy loss is achieved when f (Z) is a randomized classifier such that E[f (Z)] = Pr(Y = 1 | Z). This shows that min f E D [ (f • g(X), Y )] = H(Y | Z), min f E D [ (f • g(X), A)] = H(A | Z). To see the second part of Eq. ( 3), simply use the identity that H(Y | Z) = H(Y ) -I(Y ; Z) and H(A | Z) = H(A) -I(A; Z) with the fact that both H(Y ) and H(A) are constants that only depend on the joint distribution D. Regression Again, given a fixed feature map Z = g(X), because of the symmetry between Y and A let us focus on the analysis of finding f that minimizes E D [ (f • g(X), Y )]. In this case since (•, •) is the mean squared error, it follows that E D [ (f • g(X), Y )] = E D (f • g(X) -Y ) 2 = E D (f (Z) -Y ) 2 = E Z (f (Z) -E[Y | Z]) 2 + E Z E Y [(Y -E[Y | Z]) 2 ] ≥ E Z E Y [(Y -E[Y | Z]) 2 ] = E[Var(Y | Z)], where the third equality is due to the Pythagorean theorem. Furthermore, it is clear that the optimal mean-squared error is obtained by the conditional mean f (Z) = E[Y | Z]. This shows that min f E D [ (f • g(X), Y )] = E[Var(Y | Z)], min f E D [ (f • g(X), A)] = E[Var(A | Z)]. For the second part, use the law of total variance Var(Y ) = E[Var(Y | Z)] + Var(E[Y | Z]) and Var(A) = E[Var(A | Z)] + Var(E[A | Z]). Realizing that both Var(Y ) and Var(A) are constants that only depend on the joint distribution D, we finish the proof.

B MISSING PROOFS IN CLASSIFICATION (SECTION 4)

In what follows we first restate the propositions, lemmas and theorems in the main text, and then provide the corresponding proofs. B.1 CONVEXITY OF R CE Lemma 4.1. R CE is convex. Proof. Let Z i = g i (X) for i ∈ {0, 1} with corresponding points (I(Y ; Z i ), I(A; Z i )) ∈ R CE . Then we only need to prove that for ∀u ∈ [0, 1], (uI(Y ; Z 0 ) + (1u)I(Y ; Z 1 ), uI(A; Z 0 ) + (1 -u)I(A; Z 1 )) ∈ R CE as well. For any u ∈ [0, 1], let S ∼ U (0, 1), the uniform distribution over (0, 1), such that S ⊥ (Y, A). Consider the following randomized transformation Z: Z = Z 0 If S ≤ u, Z 1 otherwise. ( ) To compute I(Y ; Z), we have: I(Y ; Z) = E[I(Y ; Z | S)] = Pr(S ≤ u) • I(Y ; Z 0 ) + Pr(S > u) • I(Y ; Z 1 ) = uI(Y ; Z 0 ) + (1 -u)I(Y ; Z 1 ). Similar argument could be used to show that I(A; Z) = uI(A; Z 0 ) + (1u)I(A; Z 1 ). So by construction we now find a randomized transformation Z = g(X) such that (uI(Y ; Z 0 ) + (1 - u)I(Y ; Z 1 ), uI(A; Z 0 ) + (1 -u)I(A; Z 1 )) ∈ R CE . B.2 PROOF OF THEOREM 4.1 We proceed to provide the proof that the optimal value of ( 5) is the one given by Theorem 4.1. Theorem 4.1. The optimal solution of optimization problem ( 5) is max Z,I(A;Z)=0 I(Y ; Z) = H(Y ) -∆ Y |A • H(A). Proof. For a joint distribution D over (X, A, Y ) and a function g : X → Z, in what follows we use g D to denote the induced distribution of D under g over (Z, A, Y ). We first make the following claim: without loss of generality, for any joint distribution g D over (Z, A, Y ), we could find (Z 0 , A , Y ) ∼ g D and a deterministic function f , such that Y = f (A , Z 0 , S) where S ∼ U (0, 1), S ⊥ (A , Z 0 ) and I(Y ; Z ) ≥ I(Y ; Z) with Z = (Z 0 , S). To see this, consider the following construction: A , Z 0 ∼ D(A, Z), S ∼ U (0, 1). Let (a, z, s) be the sample of the above sampling process and construct Y = 1 If s ≤ E[Y | A = a, Z = z], 0 Otherwise. Now it is easy to verify that (Z 0 , A , Y ) ∼ g D and Pr(Y = 1 | A = a, Z 0 = z) = E[Y | A = a, Z = z]. To see the last claim, we have the following inequality hold: I(Y ; Z ) = I(Y ; Z 0 , S) ≥ I(Y ; Z 0 ) = I(Y ; Z). Now to upper bound I(Y ; Z), we have I(Y ; Z) = H(Y ) -H(Y | Z), hence it suffices to lower bound H(Y | Z). To this end, define D 0 := {z, ε ∈ (0, 1) | f (0, z, ε) = 1}, D 1 := {z, ε ∈ (0, 1) | f (1, z, ε) = 1}. Then, Pr((z, ε) ∈ D 0 ) = Pr(f (0, z, ε) = 1) = Pr(f (0, z, ε) = 1 | A = 0) = Pr(f (A, z, ε) = 1 | A = 0) = Pr(Y = 1 | A = 0). Analogously, the following equation also holds: Pr((z, ε) ∈ D 1 ) = Pr(Y = 1 | A = 1). Without loss of generality, assume that Pr(Y = 1 | A = 1) ≥ Pr(Y = 1 | A = 0), then Pr((z, ε) ∈ D 1 \D 0 ) ≥ Pr((z, ε) ∈ D 1 ) -Pr((z, ε) ∈ D 0 ) = | Pr(Y = 1 | A = 1) -Pr(Y = 1 | A = 0)|. But on the other hand, we know that if (z, ε) ∈ D 1 \D 0 , then f (1, z, ε) = 1 and f (0, z, ε) = 0, and this implies that Y = A, hence: H(Y | Z) ≥ H(Y | Z, S) = Pr((z, ε) ∈ D 1 \D 0 ) • H(Y | (z, ε) ∈ D 1 \D 0 ) + Pr((z, ε) ∈ D 1 \D 0 ) • H(Y | (z, ε) ∈ D 1 \D 0 ) ≥ Pr((z, ε) ∈ D 1 \D 0 ) • H(Y | (z, ε) ∈ D 1 \D 0 ) = Pr((z, ε) ∈ D 1 \D 0 ) • H(A) ≥ | Pr(Y = 1 | A = 1) -Pr(Y = 1 | A = 0)| • H(A), which implies that I(Y ; Z) ≤ H(Y ) -| Pr(Y = 1 | A = 1) -Pr(Y = 1 | A = 0)| • H(A) = H(Y ) -∆ Y |A • H(A). To see that the upper bound could be attained, let us consider the following construction. Denote α := Pr(Y = 1 | A = 0) and β := Pr(Y = 1 | A = 1 ). Construct a uniformly random Z ∼ U (0, 1) and then sample A independently from Z according to the corresponding marginal distribution of A in D. Next, define:  Y = 1 if Z ≤ α ∧ A = 0 or Z ≤ β ∧ A = 1, 0 otherwise. It is easy to see that Z ⊥ A by construction. Furthermore, by the construction of Y , we also have A, Y ∼ D(A, Y ) hold. Since I(Y ; Z) = H(Y ) -H(Y | Z), we only need to verify H(Y | Z) = ∆ Y |A • H(A) = 0. • α < Z ≤ β: In this case Y = A, hence the conditional distribution of Y given Z ∈ (α, β] is equal to the conditional distribution of A given Z ∈ (α, β]. But by our construction, A is independent of Z, which means that in this case the conditional distribution of A given Z ∈ (α, β] is just the distribution of A. Combine all the above three cases, we have:  H(Y | Z) = Pr(Z ≤ α) • H(Y | Z ≤ α) + Pr(Z > β) • H(Y | Z > β) + Pr(α < Z ≤ β) • H(Y | α < Z ≤ β) = 0 + 0 + |β -α| • H(A | α < Z ≤ β) = | Pr(Y = 1 | A = 1) -Pr(Y = 1 | A = 0)| • H(A) = ∆ Y |A • H(A) I(A; Z) = H(Z) -H(Z | A) ≥ H(Y ) -H(Z | A) ≥ H(Y ) -H(Y, Z | A) = H(Y ) -H(Y | A) -H(Y | Z, A) = H(Y ) -H(Y | A) = I(A; Y ). To attain the equality, simply set Z = f * Y (X) = Y . Specifically, this implies that 1-bit is sufficient to encode all the information for the optimal solution, which completes the proof.

C MISSING PROOFS IN REGRESSION (SECTION 5)

C.1 CONVEXITY OF R LS Analogous to the classification setting, here we first show that the feasible region R LS is convex: Lemma 5.1. R LS is convex. Proof. Let Z i = g i (X) for i ∈ {0, 1} with corresponding points (Var E[Y | Z i ], Var E[A | Z i ]) ∈ R LS . Then it suffices if we could show for ∀u ∈ [0, 1], (u Var E[Y | Z 0 ] + (1 -u) Var E[Y | Z 1 ], u Var E[A | Z 0 ]) + (1 -u) Var E[A | Z 1 ]) ∈ R LS as well. We give a constructive proof. Due to the symmetry between A and Y , we will only prove the result for Y , and the same analysis could be directly applied to A as well. For any u ∈ [0, 1], let U ∼ U (0, 1), the uniform distribution over (0, 1), such that U ⊥ (Y, A). Consider the following randomized transformation Z: Z = Z 0 If U ≤ u, Z 1 otherwise. ( ) To compute Var E[Y | Z], define K := E[Y | Z], then by the law of total variance, we have: Var E[Y | Z] = Var(K) = E[Var(K | U )] + Var E[K | U ]. We first compute Var E[K | U ]: Var E[K | U ] = Var E[E[Y | Z] | U ] = Var E[Y | U ] (The law of total expectation) = Var E[Y ] (Y ⊥ U ) = 0. On the other hand, for E[Var(K | U )], we have: E[Var(K | U )] = Pr(U = 0) • Var(K | U = 0) + Pr(U = 1) • Var(K | U = 1) = u • Var(K | U = 0) + (1 -u) • Var(K | U = 1) = u • Var E[Y | Z 0 ] + (1 -u) • Var E[Y | Z 1 ].

Combining both equations above yields:

Var E[Y | Z] = u • Var E[Y | Z 0 ] + (1 -u) • Var E[Y | Z 1 ].

Similar argument could be used to show that Var

E[A | Z] = u • Var E[A | Z 0 ] + (1 -u) • Var E[A | Z 1 ]. So by construction we now find a randomized transformation Z = g(X) such that (u • Var E[Y | Z 0 ] + (1 -u) • Var E[Y | Z 1 ], u • Var E[A | Z 0 ] + (1 -u) • Var E[A | Z 1 ]) ∈ R LS , which completes the proof. C.2 PROOF OF THEOREM 5.1 AND THEOREM 5.2 In this section, we will prove Theorem 5.1 and Theorem 5.2. We will provide proofs to both theorems in a generalized noisy setting, i.e., we no longer assume the noiseless condition so that the corresponding theorems in the noiseless setting follow as a special case. To this end, we first re-define f * Y (X) := E[Y | X] (17) f * A (X) := E[A | X] and f * Y , f * A ∈ H. We reuse the notations a, y to denote f * Y (X) = E[Y |X] = y, ϕ(X) (19) f * A (X) = E[A|X] = a, ϕ(X) . It is easy to see that the noiseless setting is indeed a special case where Y = E[Y |X], A = E[A|X] almost surely. For readers' convenience, we restate the Theorem 5.1 below: Proof. Using the law of total expectation, E[Y | Z] = E [E[Y | X] | Z] = ż X E[Y | X, Z] • p(X | Z) dX. Since Z = g(X) is a function of X, we have Z ⊥ Y | X, so E[Y | X, Z] = E[Y | X] = f * Y (X). Therefore, E[Y |Z] = ż X E[Y | X, Z] • p(X | Z) dX = ż X f * Y (X) • p(X | Z) dX = E[f * Y (X) | Z]. Hence, Var(E[Y | Z]) = Var(E[f * Y (X) | Z]). Therefore, Var E[Y | Z] = Var(E[f * Y (X) | Z]) = Var E[ y, ϕ(X) | Z] = Var y, E[ϕ(X) | Z] (Linearity of Expectation) = y, Cov(E[ϕ(X) | Z], E[ϕ(X) | Z])y . Similarly, for A = a, ϕ(X) , we have: Var E[A | Z] = a, Cov(E[ϕ(X) | Z], E[ϕ(X) | Z])a . To simplify the notation, define V := Cov(E[ϕ(X) | Z], E[ϕ(X) | Z]). Then again, by the law of total variance, it is easy to verify that 0 V Σ = Cov(ϕ(X), ϕ(X)). Hence the original maximization problem could be relaxed as follows: max Z y, V y , subject to 0 V Σ, a, V a = 0. To proceed, we first decompose y orthogonally w.r.t. a: y = y ⊥a + y a , where y ⊥a is the component of y that is perpendicular to a and y a is the parallel component of y to a. Using this orthogonal decomposition, we have ∀V : y, V y = (y ⊥a + y a ), V (y ⊥a + y a ) = y ⊥a , V y ⊥a (V 1/2 y a = 0) ≤ y ⊥a , Σy ⊥a (V Σ), where the equality above can be attained by choosing V so that the corresponding eigenvalues of V along the direction of y ⊥a coincide with those of Σ. Note that this is also feasible since the constraint To proceed, we first decompose a orthogonally w.r.t. y: a = a ⊥y + a y , where a ⊥y is the component of a that is perpendicular to y and a y is the parallel component of a to y. Using this orthogonal decomposition, we have ∀V : a, V a = (a ⊥y + a y ), V (a ⊥y + a y ) ≥ a y , V a y , (V 0), where the equality could be attained by choosing V such that V 1/2 a ⊥y = 0. On the other hand, it is clear that a y = a, y 0 • y 0 , where y 0 = y/ y is the unit vector of y. Plug a y = a, y 0 • y 0 into a y , V a y with the fact that y , V y = Var(E[Y |X]) = y, Σy , we get a y , V a y = Var(E[Y |X]) • a, y 2 y, y 2 . Again, to attain the equality, we should first construct the optimal V * matrix by eigendecomposing Σ. Specifically, this time we set all the eigenvalues of Σ whose corresponding eigenvectors are perpendicular to y to 0. Similar to what we argue in the proof of Theorem 5.1, V * is positive semidefinite but not necessarily invertible. Nevertheless, we could still define the projection matrix of V * that projects to the column space of V * as follows: P V * := V * (V * T V * ) † V * T , where Q † denotes the Moore-Penrose pseudoinverse of matrix Q. With P V * , it is easy to verify that the optimal transformation is given by Z such that E[ϕ(X) | Z] = P V * ϕ(X). To see this, we have: Cov(E[ϕ(X) | Z], E[ϕ(X) | Z]) = Var E[ϕ(X) | Z] = Var(P V * ϕ(X)) = P V * Var(ϕ(X))P T V * = P V * ΣP T V * = V * , completing the proof.

C.3 PROOF OF THEOREM 5.3

To prove Theorem 5.3, we first introduce the following decompositions of the loss functions: The following lemma is a more refined version of the Data-Processing Inequality, which gives an exact characterization of the Bayes optimality gap for a given Z. Recall that the Bayes error is E X [Var[Y |X]]. Lemma C.1 (L 2 Error Decomposition). E Z [Var[Y |Z]] -E X [Var[Y |X]] = E Z Var (E [Y |X] |Z) ≥ 0. Similarly, E Z [Var[A|Z]] -E X [Var[A|X]] = E Z Var (E [A|X] |Z) ≥ 0. Proof. Since Z = g(X) is a function of X, we have p(y|x) = p(y|x, z), or equivalently, (Y ⊥ Z)|X By law of total variance, Var(Y |Z) = E X [Var(Y |X, Z)|Z] + Var (E [Y |X, Z] |Z) = E X [Var(Y |X)|Z] + Var (E [Y |X] |Z) Taking expectation over Z, E Z Var(Y |Z) = E Z E X [Var(Y |X)|Z] + E Z Var (E [Y |X] |Z) = E X E Z [Var(Y |X)|Z] + E Z Var (E [Y |X] |Z) = E X Var(Y |X) + E Z Var (E [Y |X] |Z) , where the last equality is due to the law of total expectation. The following lemma is a direct consequence of the law of total variance. Lemma C.2 (L 2 Invariance Decomposition). Var(A) -E Z Var(A|Z) = Var(E[A|Z]) ≥ 0. We will prove a generalized version of Theorem 5.  Now we substitute ( 27), ( 28) into (26), which gives the following equivalent form of (26): min y, (Σ -V )y + λ a, V a The key technique of our lower bound is to relax the constraint V = Cov(E[ϕ(X)|Z], E[ϕ(X)|Z]) by the semi-definite constraint Σ V 0. min V :Σ V 0 y, (Σ -V )y + λ a, V a This is an SDP whose optimal solution lower bounds the objective (26). Moreover, we can show that there is a simplified form for the SDP optimal solution using eigenvalues and eigenvectors: y, (Σ -V )y + λ a, V a = y, Σy + V, λaa Tyy T = y, Σy + Σ -1/2 V Σ -1/2 , Σ 1/2 (λaa Tyy T )Σ 1/2 . Note that I Q := Σ -1/2 V Σ -1/2 0, and R := Σ 1/2 (λaa Tyy T )Σ 1/2 is a matrix with rank at most 2. When the matrix R is positive definite or negative definite, the minimum is achieved at Q = 0 or I. Otherwise, the only possibility is that R is a rank-2 matrix with one positive eigenvalue and one negative eigenvalue. By Von-Neumann's trace inequality, Q, R ≥ d i=1 σ i (R)σ d-i+1 (Q). Since σ 1 (R) > 0 = σ 2 (R) = ... = σ d-1 (R) = 0 > σ d (R) and 0 ≤ σ d (Q) ≤ 1, we have Q, R ≥ σ d (R) = σ d (Σ 1/2 (λaa T -yy T )Σ 1/2 ) The minimizer is Q = ww T , V = Σ 1/2 ww T Σ 1/2 where w is the unit eigenvector of R with eigenvalue σ d (R). By Lemma C. Hence we have completed the proof.



Extensions to the general noisy setting are feasible, but the results are less interpretable. Hence we mainly focus on the noiseless setting in this paper.



Figure 2: Information plane in classification. Shaded area corresponds to the known feasible region. Theorem 4.2. The optimal solution of optimization problem (8) is min Z,I(Y ;Z)=H(Y ) I(A; Z) = I(A; Y ).(9)

0 so the gap is 0. On the other hand, if A = Y , then ∆ Y |A = 1 and H(A) = H(Y ), so the gap achieves the maximum value H(Y ).• Maximal invariance: The gap is given by I(A; Y ). On one hand, if A ⊥ Y , then I(A; Y ) = 0 so the gap is 0. On the other hand, if A = Y , then I(A; Y ) = H(A), so again, the gap achieves the maximum value of H(A).

The convex polygon characterization of RLS.

Figure 3: Information plane in regression. Shaded area corresponds to the known feasible region.

in this case. Assume without loss of generality α ≤ β, there are three different cases depending on the value of Z: • Z ≤ α: In this case no matter what the value of A, we always have Y = 1. • Z > β: In this case no matter what the value of A, we always have Y

, which completes the proof.B.3 PROOF OF THEOREM 4.2Theorem 4.2. The optimal solution of optimization problem (8) is minZ,I(Y ;Z)=H(Y ) I(A; Z) = I(A; Y ).(9)Proof. First, realize that H(Z) ≥ I(Y ; Z) = H(Y ) by our constraint. Furthermore, we also knowthat 0 ≤ H(Y | Z, A) ≤ H(Y | Z) = H(Y ) -I(Y ; Z) = 0, which means H(Y | Z, A) = 0.With these two observations, we have:

3 without noiseless assumption, stated below: Theorem C.3. The optimal solution of the Lagrangian has the following lower bound:OPT(λ) ≥ 1 2 λ Var(E[A|X]) + Var(E[Y |X]) -(λ Var(E[A|X]) + Var(E[Y |X])) 2 -4λ a, Σy 2 + (E[Var(Y |X)]λ Var(A)).When the noiseless assumption holds, we haveVar(E[A|X]) = Var(A), Var(E[Y |X]) = Var(Y ),and E[Var(Y |X)] = 0, hence the bound above simplifies to:1 2 Var(Y )λ Var(A) -(λ Var(A) + Var(Y )) 2 -4λ a, Σy 2 .which is exactly Theorem 5.3. Proof of Theorem 5.3. By Lemma C.1 and Lemma C.2, we can decompose the objective as: E[Var(Y |Z)] -λE[Var(A|Z)] = (E[Var(Y |Z)] -E[Var(Y |X)]) + λ(Var(A) -E[Var(A|Z)]) + (E[Var(Y |X)]λ Var(A)) = E Z Var (E [Y |X] |Z) + Var(E[A|Z]) + (E[Var(Y |X)]λ Var(A))Since E[Var(Y |X)]λ Var(A) does not depend onZ, we will focus on the first two terms:min Z=g(X) E Z Var (E [Y |X] |Z) + λ Var(E[A|Z]).(26)Recall that for the squared loss,f * Y (X) = E [Y |X] , f * A (X) = E [A|X]. We will first simplify the objective in (26). We haveE Var (E [Y |X] |Z) = E Var (f * Y (X)|Z) , , Z)p(X|Z)dX. Since Z = g(X) is a function of X,we have Z ⊥ A|X, so E(A|X, Z) = E(A|X) = f * A ()p(X|Z)dX = E[f * A (X)|Z] Hence, Var(E(A|Z)) = Var(E[f * A (X)|Z])

|Z) + λ Var(E[f * A (X)|Z])}In this case, the objective (29) becomes:E Var (f * Y (X)|Z) + λ Var(E[f * A (X)|Z]) (29) =E Var ( y, ϕ(X) |Z) + λ Var(E[ a, ϕ(X) |Z]) (30) = y, E Cov(ϕ(X), ϕ(X)|Z)y + (31) λ a, Cov(E[ϕ(X)|Z], E[ϕ(X)|Z])a (32) By the law of total covariance, E Cov(ϕ(X), ϕ(X)|Z) + Cov(E[ϕ(X)|Z], E[ϕ(X)|Z]) = Cov(ϕ(X), ϕ(X)) = Σ Let V = Cov(E[ϕ(X)|Z], E[ϕ(X)|Z]) , which satisfies Σ V 0.Then, finding the feature transform Z = g(X) that minimizes (32) is equivalent to: minV =Cov(E[ϕ(X)|Z],E[ϕ(X)|Z])

Σay, Σy -(λ a, Σa + y, Σy ) 2 -4λ a, Σy 2 Therefore, OPT(λ) = y, (Σ -V )y + λ a, V a + (E[Var(Y |X)]λ Var(A)) ≥ y, Σy + σ d (R) + (E[Var(Y |X)]λ Var(A)) Σa + y, Σy -(λ a, Σa + y, Σy ) 2 -4λ a, Σy 2 + (E[Var(Y |X)]λ Var(A)) = 1 2 λ Var(E[A|X]) + Var(E[Y |X]) -(λ Var(E[A|X]) + Var(E[Y |X])) 2 -4λ a, Σy 2+ (E[Var(Y |X)]λ Var(A))

Theorem 5.1. The optimal solution of optimization problem (10) is upper bounded by

annex

of eigenvalues being 0 only applies to the direction y a , which is orthogonal to y ⊥a . To complete the proof, realize that the vector y ⊥a could be constructed as follows:y ⊥a = (Ia 0 a T 0 )y, where a 0 = a/ a is the unit vector of a. The last step is to simplify the above equation as: To show when the equality is attained, let V * be the optimal solution of (??), which could be constructed by first eigendecomposing Σ and then set all the eigenvalues of Σ to 0 whose corresponding eigenvectors are not orthogonal to a. It is worth pointing out that V * is positive semidefinite but not necessarily invertible. Nevertheless, we could still define the projection matrix of V * that projects to the column space of V * as follows:where Q † denotes the Moore-Penrose pseudoinverse of matrix Q. With P V * , it is easy to verify that the optimal transformation is given by Z such thatTo see this, we have:completing the proof.Next, we will prove Theorem 5.2, restated below: Theorem 5.2. The optimal solution of optimization problem ( 12) is lower bounded byThe following theorem is the generalized version of Theorem 5.2 in noisy setting: Theorem C.2. The optimal solution of optimization problem ( 12) is minIt is easy to see Theorem 5.2 is an immediate corollary of this result: under the noiseless assumption, we haveProof. Due to the symmetry between Y and A, here we only prove the first part of the theorem. As in the proof of Theorem 5.1, we have the following identities hold:) so that we can relax the optimization problem as follows:

C.4 EXPLICIT FORMULA FOR EIGENVALUES

The following lemma is used the in the last step of the proof of Theorem 5.3 to simplify the expression involvingWe can write tr(R) and tr(R 2 ) explicitly:Thus σ 1 (R) and σ d (R) are the roots of the quadratic equation:x 2 -(λ a, Σay, Σy )x + λ a, Σy 2λ a, Σa y, Σy = 0We complete the proof by solving this quadratic equation.

C.5 ACHIEVABILITY OF LOWER BOUND

In the proof of Theorem 5.3, we showed a lower bound on the tradeoff via an SDP relaxation. Therefore, the lower bound is achievable whenever the SDP relaxation is tight. We state this as a regularity condition on (X, ϕ). ϕ ) is regular, the lower bound in Theorem 5.3 is achievable.Proof. From the proof of Theorem 5.3, we can see that if there exists Z = g(X), such thatwhere w is the unit eigenvector of R with eigenvalue σ d (R), then the equality is achievable. It is easy to see that Σ Σ 1/2 ww T Σ 1/2 0.Therefore, choosing M = Σ 1/2 ww T Σ 1/2 in the definition of regularity guarantees the existence of Z. Hence we have completed the proof.A sufficient condition on the regularity of (X, ϕ) is the Gaussianity of ϕ(X), in which case choosing g(X) as a linear transform is sufficient: Theorem C.5. (X, ϕ) is regular if ϕ(X) follows Gaussian distribution.Proof. Note that when ϕ(X) is Gaussian, (ϕ(X), Lϕ(X)) is jointly Gaussian for any L ∈ R k×d . Let Z = Lϕ(X), then the conditional distribution ϕ(X)|Z is Gaussian, with mean and covarianceWe will prove that for any Σ M 0, there exists a linear transform L, such that M = ΣL T (LΣL T ) -1 LΣ.Consider the eigenvalue decomposition of Σ -1/2 M Σ -1/2 = U T DU , k = rank(M ), U ∈ R k×d , D ∈ R k×k , D is invertible. Then, let L = D -1/2 U Σ -1/2 , we haveTherefore we have completed the proof.We conjecture that this regularity condition holds for more general distributions beyond Gaussian.

