TIER BALANCING: TOWARDS DYNAMIC FAIRNESS OVER UNDERLYING CAUSAL FACTORS

Abstract

The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly. Published as a conference paper at ICLR 2023 procedures usually limit their scope of consideration to only observed variables, and the fairness audit is performed directly on the decision or statistics defined on observed data. In order to have a built-in capacity to capture the influence from the current decision to future data distributions, and more importantly, to induce a fair future in the long run, in this paper, we propose Tier Balancing, a long-term fairness notion that characterizes the interplay between decision-making and data dynamics through a detailed causal modeling with a directed acyclic graph (DAG). For example, the latent socio-economic status (whose estimation can be the output of a FICO credit score model), although not directly measurable, plays an important role in credit applications. We are motivated by the goal of inducing a fair future by actually balancing the inherent socio-economic status, i.e., the "tier", of agents from different groups. We summarize our contributions as follows: • We formulate Tier Balancing, a fairness notion from the dynamic and long-term perspective that characterizes the decision-distribution interplay with a detailed causal modeling over both observed variables and latent causal factors. • Under the specified data dynamics, we prove that in general, one cannot directly achieve the long-term fairness goal only through a one-step intervention, i.e., static decision-making. • We consider the possibility of getting closer to the long-term fairness goal through a sequence of algorithmic interventions, and present possibility and impossibility results derived from the one-step analysis of the decision-distribution interplay. In this section, we present the formulation of the problem of interest. We first demonstrate in Section 2.1 a detailed causal modeling of the interplay between decision-making and data dynamics. Then in Section 2.2, we formulate Tier Balancing, a long-term fairness notion that captures the decision-distribution interplay with the presented causal modeling. Let us denote the time step as T with domain of value N + . At time step T , let us denote the protected feature as A T with domain of value A = {0, 1}, additional feature(s) as X T,i with domain of value X i , the (unmeasured) underlying causal factor H T (we call it "tier") with domain of value H = (0, 1], the (unobserved) ground truth label Y (ori) and the observed label Y (obs) , with domain of value Y = {0, 1}, and the decision D T with domain of value D = {0, 1}. Figure 1 shows the causal modeling of the interplay between decision-making and underlying data generating processes, which involves multiple dynamics (from T = t to T = t + 1). 1 Underlying data dynamics (stationary components) Considering the fact that the underlying data dynamics are relatively stable with respect to the timescale of decision-making (e.g., the societal changes happen at a much larger time scale compared to a particular credit application decision), we assume that processes governing how (Y (ori) t , X t,i ) are generated from (H t , A t ) for each individual in the population are stationary and do not change over different T = t. We also assume that the underlying data generating process that governs how H t+1 is updated from (H t , Y (ori) t , D t ) across time steps is stationary, and so are the process governing the observation of Y (obs) t+1 given (D t , Y (ori) t+1 ) and the process governing the update of A t+1 from A t .

1. INTRODUCTION

The long-term fairness endeavor inevitably involves the interplay between decision policies and the underlying data generating process: when deriving a decision-making system, one usually makes use of data at hand; when we deploy such a system, the decision would impact how data will look in the future (Perdomo et al., 2020; Liu et al., 2021) . To understand why and how a data distribution responds to decision-making strategies, the investigation has to resort to causal modeling. The pursuit of long-term fairness, in turn, should also consider the changes in the underlying causal factors. Various fairness notions with different flavors have been proposed in the literature: associative fairness notions that capture the correlation or dependence between variables, e.g., Demographic Parity (Calders et al., 2009) , Equalized Odds (Hardt et al., 2016) ; causal fairness notions that involve modeling causal relations between variables, e.g., Counterfactual Fairness (Kusner et al., 2017; Russell et al., 2017) , Path-Specific Counterfactual Fairness (Chiappa, 2019; Wu et al., 2019) , Causal Multi-Level Fairness (Mhasawade & Chunara, 2021) . The previously proposed fairness notions are with respect to a snapshot of the static reality, and do not have a built-in capacity to model the distribution-decision interplay in the long-term fairness pursuit. In the effort of enforcing fairness in the dynamic setting, researchers have approached the problem from different angles: they provide causal modeling for fairness notions (Creager et al., 2020) , analyze the delayed impact or downstream effect on utilities (Liu et al., 2018; Heidari et al., 2019; Kannan et al., 2019; Nilforoshan et al., 2022) , enforce fairness in sequential or online decision-making (Joseph et al., 2016; Liu et al., 2017; Hashimoto et al., 2018; Heidari & Krause, 2018; Bechavod et al., 2019) , investigate the relation between the long-term population qualification and fair decisions (Zhang et al., 2020) , take into consideration the user behavior/action when deriving a decision policy (Zhang et al., 2019; Ustun et al., 2019; Miller et al., 2020; von Kügelgen et al., 2022) , provide fairness transferability guarantee across domains (Schumann et al., 2019; Singh et al., 2021) , or derive robust fair predictors (Coston et al., 2019; Rezaei et al., 2021) . The proposed dynamic fairness enforcing The tier H t fully captures the individual's key property that is directly relevant to the scenario of interest, and therefore is the cause of Y (ori) t and X t,i 's instead of the other way around. For example, the improvement in the socio-economic status can be reflected through an increase in income, while manipulating one's income only by changing the recorded number does not affect the actual ability to repay the loan. The determination of causal direction aligns with causal modelings in previous literature (see, e.g., Zhang et al. 2020) . Decision-making dynamics (non-stationary components) The institution (decision maker) assigns decision D t to each individual according to the observed features (A t , X t,i ) and the outcome A t X t,1 X t,2 X t,3 D t θ t • • • A t+1 θ t+1 D t-1 θ t-1 A t-1 H t-1 H t H t+1 Y (ori) t Y (obs) t Y (obs) t+1 Y (ori) t+1 T = t -1 • • • T = t T = t + 1 Y (ori) t-1 Figure 1 : The causal modeling of the decision-distribution interplay. The circle (diamond) indicates that the corresponding variable (underlying factor) is unobservable. record Y (obs) t . The interpretation of the aforementioned variables depends on the problem at hand. For instance, in the credit application scenario where D t denotes the application decision (approval or denial), we can interpret the latent tier H t as the underlying socio-economic status of an individual. Since the decision-making strategy can vary across different time steps, we explicitly introduce the underlying factor θ t (e.g., a hyperparameter or an auxiliary variable) to indicate such (possible) non-stationary property of decision-making. The causal path from θ t to θ t+1 indicates the similarity between decision-making strategies as time goes by (e.g., the continuing interest on utility), although strategies themselves are not necessarily identical across different time steps. We interpret the variable Y (ori) t as "whether or not one would repay the loan were he/she approved the credit at T = t -1 (which might not be the case in reality)". The variable Y (obs) t is observed only if this individual actually got approved at T = t -1, i.e., D t-1 = 1. We distinguish between the underlying ground truth Y (ori) t and the observed record Y (obs) t because of their different roles in the decision-distribution interplay. On the one hand, only Y (obs) t is observed and therefore accessible to the decision maker (e.g., for training and evaluation). On the other, since the potential outcome Y (ori) t reflects individual's inherent characteristic, which is not directly relevant to whether it is observable, Y (ori) t is utilized when the underlying data generating process specifies the update from H t to H t+1 .

2.2. THE NOTION OF TIER BALANCING

Definition 2.1 (K-Step Tier Balancing). Under the specified dynamics, starting from any time step T and a given K ≥ 0, let us denote a sequence of K decision-making strategies as D T :T +K-1 := {D T , ..., D T +K-1 } (an empty set if K = 0), and the latest hidden tier after K-step decision-making as H T +K . We say D T :T +K-1 satisfies K-Step Tier Balancing, if at T + K the following condition holds true (where "⊥ ⊥" denotes statistical independence): H T +K ⊥ ⊥ A T +K , where H T +K is updated from (H T , Y (ori) T :T +K-1 , D T :T +K-1 ). (1) Equation 1 captures the statistical consequence in the future (in the form of an associative relationship) induced by the interplay between the underlying data dynamics and decision-making policies along the way. The causal modeling is essential in capturing our long-term fairness goal, since the attainment of Tier Balancing is an induced outcome of a sequence of K decision-making strategies D T :T +K-1 (which are indispensable although the fairness notion itself is not explicitly defined on decisions). Our Tier Balancing notion of algorithmic fairness is distinguished from previously proposed fairness notions in several important ways. 2 To begin with, Tier Balancing has a built-in dynamic flavor, whose definition involves variables that span across multiple time steps. Therefore the audit of Tier Balancing inevitably requires long-term and dynamic analysis, which is very different from previously proposed (both associative and causal) fairness notions defined with respect to a static snapshot of reality (e.g., Calders et al., 2009; Hardt et al., 2016; Kusner et al., 2017; Chiappa, 2019) . Besides, considering the fact that the decision-distribution interplay often involves situation changes in the hidden causal factors, Tier Balancing extends the scope of fairness consideration beyond only observed variables to hidden causal factors, which makes our notion a technically more challenging but more natural long-term fairness goal to achieve. The endeavor to explore the possibility of defining fairness in terms of latent causal factors is not an unrealistic fantasy. Recent advances in causal discovery literature have established identifiability results (under certain assumptions) on causal structures among latent variables (Xie et al., 2020; Adams et al., 2021; Kivva et al., 2021; Xie et al., 2022) , which provide not only a theoretical justification, but also an indication of the potential, for our effort in exploring long-term fairness endeavor through modeling latent causal factors. Furthermore, although Tier Balancing is not directly defined in terms of the decisions themselves, Tier Balancing is characterized with a detailed causal modeling that involves both decision-making and data dynamics. The explicit causal modeling of the decision-distribution interplay offers both challenges and opportunities for more principled fairness inquiries in the long-term, dynamic context (Hu & Chen, 2018; Liu et al., 2018; Mouzannar et al., 2019; Heidari et al., 2019; Zhang et al., 2020) . In Definition 2.1, we specify the step K at which Tier Balancing is evaluated. If K = 0, i.e., H T ⊥ ⊥ A T happens to be attained initially (although in general it may not be the case), Tier Balancing is attained at the beginning. 3 When H T ⊥ ̸ ⊥ A T and K ≥ 1, K-Step Tier Balancing is achieved with respect to the underlying causal factor H T +K .

3. CHARACTERIZING TIER BALANCING

In Section 2, we propose a detailed causal modeling of the interplay between decision-making and underlying data generating processes, based on which we formulate a novel long-term fairness notion Tier Balancing. Our model is applicable to a wide range of resource allocation scenarios, e.g., hiring practice (Hu & Chen, 2018; Kannan et al., 2019) , credit application (Liu et al., 2018) , predictive policing (Ensign et al., 2018; Elzayn et al., 2019) . For the clarity of discussion, in this section we consider a running example of credit application where agents in a fixed population repeatedly apply for credit. We first demonstrate how one can apply the proposed causal modeling in the credit application scenario in Section 3.1. Then in Section 3.2, we characterize the Single-step Tier Imbalance Reduction (STIR) term for the purpose of conducting one-step analysis on the Tier Balancing notion of long-term fairness.

3.1. MODELING DETAIL OF DECISION-DISTRIBUTION INTERPLAY

As shown in Figure 1 , the unmeasured latent causal factor H t (the hidden socio-economic status) is the actual root cause of the ground truth label Y (ori) t as well as the (possibly) observed label Y (obs) t (the repayment record). For any given tier H t = h t , let us assume that the unobserved ground truth Y (ori) t is sampled from a Bernoulli distribution with h t as the success probability, and that the observed repayment record Y (obs) t depends on both the ground truth label Y (ori) t and the previously received decision D t-1 : Y (obs) t = Y (ori) t if D t-1 = 1, is undefined if D t-1 = 0, where Y (ori) t ∼ Bernoulli(h t ). From Equation 2we can see that Y (obs) t is a masked copy of Y (ori) t (masked by D t-1 ), and we have the following proposition capturing the property of the marginal distribution of Y (obs) t : Proposition 3.1. At time step T = t, for any H t = h t ∈ (0, 1], under the specified dynamics, among the population where ground truth is actually observable, i.e., Y (obs) t is not undefined, we have: Y (obs) t ∼ Bernoulli(h t ). Proposition 3.1 captures the fact that among the population where repayment record Y (obs) t is observed, the marginal distributions of Y (obs) t and Y (ori) t are actually identical. This property indicates that although one does not have access to the unobserved tier, i.e., the socio-economic status H, one can still use the observed Y (obs) as a bridge to infer its behavior. Fact 3.2. Let A be the domain of value for the protected feature A t , and E be the domain of value for all other exogenous noise terms E t . For each time step , and H t with functions of all root causes (including the protected feature A t and exogenous noise terms E t ) in the system without explicitly specifying the respective functional forms, which may depend on further assumptions on the joint distribution and the time step T = t. 4 Fact 3.2 is a direct result of representing causal relations with a functional causal model (FCM) (Spirtes et al., 1993; Pearl, 2009) , and is for the purpose of notational convenience in later analysis. 5 Assumption 3.3 (Multiplicative update of underlying tier). Let α D , α Y ∈ [0, 1 2 ) be the parameters that capture the influences from current decision D t and ground truth Y (ori) t to next step: H t+1 = min 1, H t • 1 + α D (2D t -1) + α Y (2Y (ori) t -1) . (3) Assumption 3.3 states that H t+1 treats H t as a baseline, with increase or decrease in a multiplicative form based on agent's received decision and repayment information. We are inspired by the evolution theory where multiplicative updates have been a common modeling choice to capture updates in relevant statistics (Friedman & Sinervo, 2016; Dawkins & Davis, 2017) . The explicit dependency on the update parameters, α D and α Y , related to D t and Y (ori) t respectively, characterizes the two important aspects of our model: the update in individuals' tier potentially depends on the received decision D t , as well as the ground truth Y (ori) t (even if unobserved). 6 The condition α D , α Y ∈ [0, 1 2 ) makes sure that H t+1 > 0, and the min{•, 1} operation makes sure that H t+1 is upper-capped by 1. In practical scenarios where agents repetitively apply for resource (e.g., in our running example of credit application) at each time step T = t, with the entire group remains unchanged, we have: Assumption 3.4. The protected feature at time step T = t + 1 is an identical copy of that at T = t: ∀ T = t : A t+1 = A t . (4)

3.2. ONE-STEP ANALYSIS TOWARDS TIER BALANCING

In Definition 2.1 we established the long-term fairness goal. Considering the dynamic property of this fairness notion, apart from defining "what exactly is fair in the long run", in order to bridge the cognitive gap we also need to clarify the meaning of "getting closer to the long-term fairness goal", i.e., achieving the long-term fairness goal through a sequence of algorithmic interventions. In this section, we present the one-step theoretical analysis framework and characterize the Single-step Tier Imbalance Reduction (STIR) term ∆ STIR | t+1 t for the purpose of investigating Tier Balancing.

3.2.1. SINGLE-STEP TIER IMBALANCE REDUCTION (STIR)

Recall that the long-term fairness objective is the independence between the protected feature A T and the hidden tier H T = f T (A T , E T ) (at a time step T ). Equivalently, we would like H T to not be a function of A T . We can view f T (0, E T ) and f T (1, E T ) as two dependent random variables, and quantify the amount of "getting closer to the long-term fairness goal" by comparing the absolute difference between f T (0, E T ) and f T (1, E T ) before (when T = t) and after (when T = t + 1) one-step update, and see if the gap decreases. Since the individual-level exogenous noise term E T is the input, this comparison of absolute differences is on the individual level. Therefore in order to quantitatively characterize the overall amount of "getting closer to the long-term fairness goal", we need to take into account different possible combinations of decision D t and outcome Y (ori) t (when T = t) for each individual, and aggregate the individual-level comparisons over the population. Given combinations of D t and Y (ori) t , when E t = ϵ, we denote the conditional joint probability density of f t (0, E t ), f t (1, E t ) as q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ := q t f t (0, ϵ), f t (1, ϵ) | g D t (0, ϵ) = d, g D t (1, ϵ) = d ′ , g Y (ori) t (0, ϵ) = y, g Y (ori) t (1, ϵ) = y ′ , and calculate the Single-step Tier Imbalance Reduction (STIR) term from T = t to T = t + 1, denoted by ∆ STIR | t+1 t , as following: 7 4 We implicitly adopt the assumption that the protected feature itself is not caused by other variables. 5 In Appendix B.3, we discuss the role of exogenous terms Et in Fact 3.2. 6 In Assumption 3.3 we explicitly specify that we are considering Y (ori) t instead of Y (obs) t . At T = t, every individual has a binary ground truth Y (ori) t . However, it might not be the case that everyone has an observable Y (obs) t (since Y (obs) t is undefined for an individual if its Dt-1 = 0). ∆ STIR | t+1 t := E [ |f t+1 (0, E t+1 ) -f t+1 (1, E t+1 )| ] -E [ |f t (0, E t ) -f t (1, E t )| ] = d,d ′ ,y,y ′ ∈{0,1} P t (d, d ′ , y, y ′ ) • ϵ∈E ξ∈E q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ • |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G t (f t , g D t , g Y (ori) t ; d, d ′ , y, y ′ , ϵ, α D , α Y )}dξdϵ, where φ t (ϵ) := f t (0, ϵ)f t (1, ϵ), G t is a function whose value only relies on the information available at time step T = t, and P t (d, d ′ , y, y ′ ) is the joint distribution of (d, d ′ , y, y ′ ): P t (d, d ′ , y ′ , y ′ ) := P t g D t (0, E t ) = d, g D t (1, E t ) = d ′ , g Y (ori) t (0, E t ) = y, g Y (ori) t (1, E t ) = y ′ . We can then characterize "getting closer to long-term fairness goal" via the inequality ∆ STIR | t+1 t < 0.

3.2.2. SIMPLIFICATION ASSUMPTIONS

From Equation 5we can see that the calculation of ∆ STIR | t+1 t requires knowledge about the gap comparison for each individual |φ t+1 (e t+1 )| -|φ t (e t )|, the conditional joint density before one-step dynamics q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ , and the joint probability for different combinations of decision and ground truth before one-step dynamics P t (d, d ′ , y, y ′ ). In order to quantitatively analyze the property of ∆ STIR | t+1 t , it is essential that we have access to all three aforementioned quantities. To begin with, we need to know the instantiations of φ t+1 (e t+1 ) given φ t (e t ) under the specified dynamics. Luckily, as we have illustrated in Table 4 , Table 5 , and Table 6 of Appendix, under the specified dynamics we can list all possible cases of the term φ t+1 (e t+1 ) given φ t (e t ). Besides, we need additional knowledge on the conditional joint density q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ . For the purpose of better elaboration, we present two assumptions on the behavior of this conditional joint density -a qualitative assumption and a quantitative assumption: Assumption 3.5 (Qualitative assumption). For any time step T = t and any exogenous term E t = ϵ ∈ E, let us denote y = g Y (ori) t (0, ϵ) and y ′ = g Y (ori) t (1, ϵ). The following inequalities hold: P t f t (0, ϵ) > f t (1, ϵ) | y > y ′ > P t f t (0, ϵ) < f t (1, ϵ) | y > y ′ , P t f t (0, ϵ) < f t (1, ϵ) | y < y ′ > P t f t (0, ϵ) > f t (1, ϵ) | y < y ′ . ( ) Assumption 3.6 (Quantitative assumption). On top of Assumption 3.5, let us further assume that the conditional joint density q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ satisfies the following condition: q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ = γ (up) dd ′ yy ′ if f t (0, ϵ) ≤ f t (1, ϵ) γ (low) dd ′ yy ′ if f t (0, ϵ) > f t (1, ϵ) , where γ (low) dd ′ yy ′ + γ (up) dd ′ yy ′ = 2, γ (low) dd ′ yy ′ < γ (up) dd ′ yy ′ when y < y ′ , and γ (low) dd ′ yy ′ > γ (up) dd ′ yy ′ when y > y ′ . Assumption 3.5 is rather mild, stating that for any given exogenous noise term (of an individual) E t = e t , whenever the ground truth Y (ori) t favors certain demographic group, it is more likely that the underlying tier also favors the same group. Assumption 3.6 is just a special case of Assumption 3.5, with quantitative characteristics built-in for technical purposes. 8Lastly, we need to know the (behavior of) joint probability density P t (d, d ′ , y, y ′ ) for all combinations of (d, d ′ , y, y ′ ). In fact, as we shall see in Section 4, when taking into consideration certain characteristics of the predictor, the joint probability P t (d, d ′ , y, y ′ ) would follow some patterns that can simplify the analysis.

4. ONE-STEP ANALYSIS TOWARDS THE LONG-TERM FAIRNESS GOAL

In this section, we consider the possibility of attaining the long-term fairness goal with (a sequence of) one-step interventions. We first present a negative result that in general one cannot hope to achieve the long-term fairness goal through a single one-step intervention, i.e., static decision-making. In light of this result, we further investigate the possibility of getting closer to the long-term fairness goal with a sequence of one-step interventions, and present possibility and impossibility results accordingly.

4.1. THE GENERAL IMPOSSIBILITY OF ACHIEVING 1-STEP TIER BALANCING

In the following theorem, we prove that it is in general impossible to achieve H t+1 ⊥ ⊥ A t+1 when initially H t ⊥ ̸ ⊥A t , i.e., in general we cannot achieve Tier Balancing with a single one-step intervention. Theorem 4.1. Let us consider the general situation where both D t and Y (ori) t are dependent with A t , i.e., D t ⊥ ̸ ⊥ A t , Y (ori) t ⊥ ̸ ⊥ A t . Then under Fact 3.2, Assumption 3.3, and Assumption 3.4, as well as the specified dynamics, when H t ⊥ ̸ ⊥ A t , only if at least one of the following conditions holds true for all e t ∈ E can we possibly attain H t+1 ⊥ ⊥ A t+1 : (1) The ratio ft(0,et) ft (1,et) has a specific domain of value ). Theorem 4.1 lists all possible ways to directly achieve the long-term fairness goal (under the specified condition). As we can see that all of these conditions are rather restrictive: Condition (1) imposes strong conditions on the functional form of f (•). In particular, when f t (0, e t ), f t (1, e t ) are both continuous random variables with non-zero density everywhere on the support (0, 1], the ratio is still a continuous random variable (because the density is simply an integral over positive multiplications). We can see that the event specified in Condition ( 1) is a zero-measure one. Conditions (2-5) all require trivial decision-making policies. Therefore, in general one cannot directly achieve the long-term fairness goal. We need to consider the possibility of approaching the goal step-by-step.

4.2. POSSIBILITY OF GETTING CLOSER TO TIER BALANCING VIA ONE-STEP INTERVENTIONS

In Section 4.1, we have seen that under the specified dynamics, single one-step intervention is in general not enough in order to achieve long-term fairness. In this section, we investigate if certain strategy can get closer to the long-term fairness goal through a sequence of algorithmic interventions. If we follow the same principle to derive the decision-making policy from the data at each time step, one-step analysis suffices for the purpose of studying the interplay between decision and distribution.

4.2.1. ONE-STEP ANALYSIS ON PERFECT PREDICTOR

In this section, we consider the perfect predictor, where the predicted output equals to the underlying ground truth, i.e., D t = Y (ori) t . The output of the perfect predictor D t is fully specified by the value of ground truth Y (ori) t , and therefore is conditionally independent from the protected feature A t given Y (ori) t . Based on the definition of Equalized Odds (Hardt et al., 2016) , the prefect predictor is also the best possible Equalized Odds predictor (at time step t) with an accuracy of 100%. Furthermore, the joint probability P t (d, d ′ , y, y ′ ) for a perfect predictor is not always positive for any possible combination of (d, d ′ , y, y ′ ). We have the following sufficient condition for this joint probability to be zero: (ori) t ⊥ ̸ ⊥ A t . Under Fact 3.2, Assumption 3.3, and Assumption 3.4, and Assumption 3.6, as well as the specified dynamics, when H t ⊥ ̸ ⊥ A t , the perfect predictor does not have the potential to get closer to the long-term fairness goal after one-step intervention, i.e., d ̸ = y, or d ′ ̸ = y ′ =⇒ P t (d, d ′ , y, y ′ ) = 0. ( D t = Y (ori) t =⇒ ∆ (Perfect Predictor) STIR t+1 t > 0. Compared to Theorem 4.1 that one in general cannot directly attain the long-term fairness goal (balanced tier) through a one-step intervention, Theorem 4.2 provides further insights regarding 

4.2.2. ONE-STEP ANALYSIS ON COUNTERFACTUAL FAIR PREDICTOR

In this section, we consider the Counterfactual Fair (Kusner et al., 2017) predictor. Similar to the one-step analysis on perfect predictors, we need to make use of the characteristic of Counterfactual Fair predictors to simplify the quantitative analysis on the term ∆ STIR | t+1 t . The definition of Counterfactual Fairness requires the predictor to satisfy g D t (0, E t ) = g D t (1, E t ) within each time step T = t. Therefore, we have the following sufficient condition for the joint probability P t (d, d ′ , y, y ′ ) to be zero: d ̸ = d ′ =⇒ P t (d, d ′ , y, y ′ ) = 0. (10) Theorem 4.3. Let us consider the general situation where both D t and Y (ori) t are dependent with A t , i.e., D t ⊥ ̸ ⊥A t , Y (ori) t ⊥ ̸ ⊥A t . Let us further assume that the data dynamics satisfies α D ∈ (0, 1 2 ), α Y = 0. Then under Fact 3.2, Assumption 3.3, Assumption 3.4, and Assumption 3.6, as well as the specified dynamics, when H t ⊥ ̸ ⊥ A t , it is possible for the Counterfactual Fair predictor to get closer to the long-term fairness goal after one-step intervention, if certain properties of the data dynamics and the predictor behavior are satisfied simultaneously, i.e.,    g D t (0, E t ) = g D t (1, E t ), Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0) < 27 8 α D ∈ Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0) 1 3 -1, 1 2 , α Y = 0 =⇒ ∆ (Counterfactual Fair) STIR t+1 t < 0. ( ) Theorem 4.3 demonstrates the possibility (not guarantee) of getting closer to the long-term fairness goal through one-step interventions (under certain conditions) with Counterfactual Fair predictors. Compared to the general impossibility results for perfect predictors (Theorem 4.2), there are additional requirements (on both data dynamics and P t (d, d ′ , y, y ′ )) accompanying the possibility result for Counterfactual Fair predictors. That being said, Theorem 4.3 clearly illustrates that the understanding of data dynamics through a detailed causal modeling, combined with a suitable decision-making strategy, can provide us with a promising way to approach the long-term dynamic fairness goal (K-Step Tier Balancing with K > 1), step by step.

5. EXPERIMENTS

In this section, we present experimental results on both simulated and real-world FICO data set from Board of Governors of the Federal Reserve System (US) (2007).foot_5 In the sequence of decisiondistribution interplay, the latent causal factor H T is updated according to the specified dynamics (Equation 3) at each time step. The output of the decision policy (at each time step) depends on the specific scenario. In particular, we consider perfect predictors and Counterfactual Fair (Kusner et al., 2017 , Level 1 implementation) predictors.

5.1. DECISION WITH PERFECT PREDICTORS ON SIMULATED DATA

In Figure 2 we present the 20-step interplay between decision and the underlying data generating process on the simulated data. The distributions of H t for different groups are initialized with truncated Gaussian distributions and Uniform distributions, respectively. During each time step T = t we generate ground truth labels Y (ori) t according to data dynamics specified in Section 3.1 and set the decision D t to be equal to the ground truth Y (ori) t (perfect predictor as the decision-making policy); then the pair of (Y (ori) t , D t ) are utilized by the data dynamics to determine the tier H t+1 for next step. As we can see from the left-hand-side figures in panels (a) and (b), the gap between tier for different groups is enlarged as the time goes by. This indicates that interventions through decision with perfect predictors did not get closer to the long-term fairness goal.

5.2. DECISION WITH COUNTERFACTUAL FAIR PREDICTORS ON CREDIT SCORE DATA

The FICO credit score data set contains 301,536 records of TransUnion credit score from 2003 (Board of Governors of the Federal Reserve System (US), 2007). In the preprocessed credit score data set (Hardt et al., 2016) , we convert the cumulative distribution function (CDF) of TransRisk score among different groups into group-wise density distributions of the credit score, and use them as the initial tier distributions for different groups. In Figure 3 we present the summary of a 20-step interplay between decision with Counterfactual Fair predictors and the underlying data generating process on the credit score data set. The Counterfactual Fair decision-making strategy is retrained after each one-step data dynamics. From Figure 3 (a) we can observe that the gap between step-by-step tracks of tiers for different groups actually decreases before increasing again (around step 12). This indicates that decision with Counterfactual Fair predictors does have the potential to get closer to the long-term fairness goal, if the data dynamics and the initial condition (for each one-step analysis at different time step) satisfy certain properties. Figure 3 (a) also highlight the importance of our dynamic fairness analyzing framework: if one does not model the interplay between decision and data dynamics, one may well deviate from long-term fairness goal even after making some progress by getting closer to the fairness objective.

6. CONCLUSION AND FUTURE WORK

In this paper, we propose Tier Balancing, a dynamic fairness notion that characterizes the decisiondistribution interplay through a detailed causal model over both observed variables and underlying causal factors. We characterize Tier Balancing in terms of a one-step analysis framework on Singlestep Tier Imbalance Reduction (STIR). We show that in general one cannot directly achieve the long-term fairness goal only through a one-step intervention, i.e., static decision-making. We further show that under certain conditions it is possible (but not guaranteed) for one to get closer to the long-term fairness goal with (a sequence of) Counterfactual Fair decisions. Our results highlight the challenges and opportunities of enforcing a fairness notion that has builtin capacity to model decision-distribution interplay over underlying causal factors. Future works naturally include developing algorithms to effectively and efficiently enforce the Tier Balancing notion of long-term, dynamic fairness for various practical scenarios.

ETHICS STATEMENT

The motivation of our work is to pursue long-term fairness. The conduct of research is under full awareness of, and with adherence to, ICLR Code of Ethics. We focus on the decision-distribution interplay and present a detailed causal modeling on both observed variables and latent causal factors. We present challenges and opportunities offered by this new type of fairness endeavor, and hope our work can inspire further research to promote fairness in the long run.

SUPPLEMENT TO "TIER BALANCING: TOWARDS DYNAMIC FAIRNESS OVER UNDERLYING CAUSAL FACTORS"

Zeyu Tang 1 , Yatong Chen 2 , Yang Liu 2 , and Kun Zhang 1,3 1 Department of Philosophy, Carnegie Mellon University 2 Computer Science and Engineering Department, University of California, Santa Cruz 3 Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence zeyutang@cmu.edu, ychen592@ucsc.edu, yangliu@ucsc.edu, kunz1@cmu.edu 

A DETAILED DISCUSSIONS ON RELATED WORKS

In this section, we provide detailed discussions on related works. In particular, considering our focus on providing a novel long-term fairness notion with the help of the detailed causal modeling of involved dynamics, we compare our work with previous literature on causal notions of fairness, as well as fairness inquiries in dynamic settings. A.1 CAUSAL NOTIONS OF FAIRNESS Various causal notions of algorithmic fairness have been proposed in the literature, for instance, fairness notions defined in terms of the (non-)existence of certain causal paths in the graph (Kamiran et al., 2013; Kilbertus et al., 2017; Zhang et al., 2017) , fairness notions defined through estimating or bounding causal effects (Kusner et al., 2017; Chiappa, 2019; Wu et al., 2019; Mhasawade & Chunara, 2021) , fairness notions defined with respect to statistics on certain factual/counterfactual groups (Imai & Jiang, 2020; Coston et al., 2020; Mishler et al., 2021) . The proposed causal notions audit fairness in an instantaneous manner, i.e., the fairness inquires are with respect to a snapshot of reality, and the scope of consideration is limited to observed variables only. Our Tier Balancing notion has a built-in capacity to inquire fairness in the long-term and dynamic setting, which is very different from instantaneous causal fairness notions (beyond the fact that our notion encompasses latent causal factors). While we can detect and measure discrimination based on previous (instantaneous) causal notions of fairness (e.g., the existence of certain causal paths or causal effects), eliminating such existence of causal paths or causal effects is a valid goal to achieve but might not be the means one should opt for. To begin with, there is no guarantee that eliminating a causal path or effect results in non-existence of such causal path or effect in the future under the interplay between decision-making and data dynamics. Furthermore, the data generating processes represented in the causal model might not be easily manipulable under the same timescale of decision-making (e.g., the ones governed by nature and/or the mode and structure of a society). One cannot expect that the manipulation on the causal model (for the purpose of enforcing fairness notions) directly translate to real-world changes in the underlying data generating processes. Different from previous causal fairness notions, instead of directly "going against" the underlying data generating process (e.g., by eliminating certain causal path or causal effect), our Tier Balancing notion encourages "working with" the underlying data generating processes. With a detailed causal modeling of the decision-distribution interplay, Tier Balancing emphases on the possibility of inducing a future data distribution that is fair in the long run.

A.2 FAIRNESS INQUIRES IN DYNAMIC SETTINGS

Previous literature have considered dynamic fairness in specific practical scenarios, for instance, opportunity allocation in labor market (Hu & Chen, 2018 ), a pipeline consisting of college admission followed by hiring (Kannan et al., 2019) , opportunity allocation in credit application (Liu et al., 2018) , and resource allocation in predictive policing (Ensign et al., 2018) . Different from previous literature, we present a detailed causal modeling of the decision-distribution interplay that is general enough to be applicable in various resource allocation problems (e.g., loan applications, hiring practices) while also being specific enough to encompass nuances in data dynamics for the particular practical scenario of interest. In terms of the analyzing framework, closely related works have considered the one-step analysis (Liu et al., 2018; Kannan et al., 2019; Mouzannar et al., 2019; Zhang et al., 2019) . However, previous works focus on the long-term effect of imposing certain fairness notions that are readily available, for example, Demographic Parity (Calders et al., 2009; Liu et al., 2018; Mouzannar et al., 2019) and Equal Opportunity (Hardt et al., 2016; Liu et al., 2018) . In our work, we formulate a novel notion of long-term fairness, namely, Tier Balancing, and explore the possibility of providing a fairness notion that characterizes the dynamic nature of decision-distribution interplay through detailed causal modeling on both observed variables and latent causal factors. In terms of the modeling choice for data dynamics, most closely related works model data dynamics using variants of Markov Decision Processes (MDPs) (Jabbari et al., 2017; Siddique et al., 2020; Zhang et al., 2020; D'Amour et al., 2020; Wen et al., 2021; Zimmer et al., 2021; Ge et al., 2021) . For example, Zhang et al. (2020) consider the partially observed Markov decision process (POMDP) model, and conduct evolution and equilibrium analysis with respect to Demographic Parity and Equal Opportunity notions of fairness. The dynamics are modeled through transition matrices on group-level qualification rates. Compared to the modeling of transition matrices in MDPs, our model is more fine-grained and on the individual level, answering the call for "richer and more complex modelings [of involved dynamics]" in previous literature (Hu & Chen, 2018) . Another closely related work is the one-step analysis on the impact of causal fairness notions on downstream utilities conducted by Nilforoshan et al. 

B.1.1 ADDITIONAL MODELING DETAILS OF THE INVOLVED DYNAMICS

We use X t,i 's to represent three different patterns (instead of the number of count) of variables with respect to how observed features are caused by the protected feature A t and the latent causal factor H t . There are three types of observed features: (1) features that only have the latent causal factor H t as the case, e.g., X t,1 , (2) features that have both the latent causal factor H t and the protected feature A t as cause, e.g., X t,2 , and (3) features that only have the protected feature A t as the cause, e.g,. X t,3 . For conciseness, we omit features that are not relevant to the practical scenario of interest, i.e., variables that are not causally relevant to (H t , A t ). One can replace X t,i 's with the actual number of additional features together with the causal relations among them in specific practical scenarios. At every time step T = t, the decision-making strategy D t is trained on the joint distribution (A t , X t,i , Y (obs) t ). However, when making the decision, D t only takes (A t , X t,i ) as input. Since we are modeling causal relations in data generating processes, we only include a directed edge in the DAG if there is a causal relation between variables. Therefore, the data generating process of D t does not involve an edge between Y (obs) t and D t .

B.1.2 THE PRACTICAL SCENARIOS OF INTEREST

As we can see from previous literature (discussed in Appendix A.2), the modeling choices are closely related to the practical scenarios of interest, and therefore, can be very different in terms of modeling details of the involved dynamics in long-term and dynamic settings. Our causal modeling of repetitive resource application and allocation keeps track of individuallevel situation changes, and enables informative and principled analysis on the decision-distribution interplay in different practical scenarios. For example, in credit application (e.g., Liu et al. 2018) , the agents are clients and the latent causal factor (tier) can be individual's socio-economic status or creditworthiness; in predictive policing (e.g., Ensign et al. 2018) , the agents are neighborhoods and the latent tier can be neighborhood's safety ratings; in the dual market pipeline (e.g., temporary labor markets followed by the permanent labor market considered in Hu & Chen 2018) or the admissionfollowed-by-hiring pipeline (e.g., Kannan et al. 2019) , the agents are applicants who subject to a sequence of decisions and the latent tier can be the relevant qualification for the school program and the job. However, when the decision received by the individual is once in a lifetime (or at least very long time compared to the timescale of the decision-making), repeated application and allocation of resource may not be a suitable modeling choice. For example, college admission decisions are made on a yearly basis but an individual does not repeatedly apply for college every single year (Mouzannar et al., 2019) . In this case, if we focus on the decision made by a specific college, it is more natural to study changes in the population in terms of the group-level qualification profiles (Mouzannar et al., 2019) . As another example, in the context of health care (e.g., Mhasawade & Chunara 2021) , when the resource takes the form of the medical treatment for the purpose of improving health outcome, not all treatment requires regular doses and therefore, repeated allocation modeled on the individual-level may not be an optimal choice. One can, for instance, resort to the modeling at the level of subgroups as an alternative (Mhasawade & Chunara, 2021) . Considering the difference in semantics of fairness in various practical scenarios, previous literature has pointed out that there is in general no one-size-fits-all solution for algorithmic fairness (e.g., Kearns & Roth 2019) . By presenting a detailed causal modeling for the decision-distribution interplay, we do not intend to provide a general framework to encompass long-term fairness considerations in all practical scenarios. Instead, we would like to demonstrate the opportunities and challenges and hope our work can inspire further research.

B.2 WHEN TIER BALANCING IS INITIALLY SATISFIED

In the paper we have presented possibility and impossibility results to achieve, or get closer to, the long-term fairness goal when Tier Balancing is not initially satisfied. It is natural to wonder what we should do if we find out that Tier Balancing happen to be satisfied during fairness audit. In fact, as we shall see in Proposition B.1, if Tier Balancing is satisfied as the initial condition, under the specified dynamics, one can use Demographic Parity (Calders et al., 2009) decision-making strategy to maintain the status of satisfying Tier Balancing. This indicates that when Tier Balancing is satisfied (as a lucky initial condition, or as a result of K-step interventions), we have at least one way to maintain the fair state of affairs. ), among which none of them is a function of A t . Then, we have H t+1 is not a function of A t . Recall that in the specified dynamics, the same group of agents repetitively apply for credit with the entire group unchanged. According to Assumption 3.4, A t+1 is an identical copy of A t . Therefore we have H t+1 is not a function of A t+1 , i.e., H t+1 ⊥ ⊥ A t+1 . Let us first present the definition of a functional causal model (Spirtes et al., 1993; Pearl, 2009) : Definition B.2 (Functional Causal Model). We can represent a causal model with a tuple (E, V, F) such that: A t X t,1 X t,2 X t,3 D t H t Y (ori) t Y (obs) t T = t θ t ǫ Y (ori) t ǫ Xt,1 ǫ Xt,2 ǫ Xt,3 ǫ Dt D t θ t θ t+1 D t-1 θ t-1 Y (obs) t Y (obs) t+1 A t X t,1 X t,2 X t,3 • • • A t+1 A t-1 H t-1 H t H t+1 Y (ori) t Y (ori) t+1 T = t -1 • • • T = t T = t + 1 Y (ori) t-1 ǫ Ht ǫ Y (obs) t (1) V is a set of observed variables involved in the system of interest; (2) E is a set of exogenous variables that we cannot directly observe but contains the background information representing all other causes of V and jointly follows a distribution P (E); (3) F is a set of functions (also known as structural equations) {f 1 , f 2 , . . . , f n } where each f i corresponds to one variable V i ∈ V and is a mapping E ∪ V \ {V i } → V i . The triplet (E, V, F) is known as the functional causal model (FCM). We can also capture causal relations among variables via a directed acyclic graph (DAG), where nodes (vertices) represent variables and edges represent functional relations between variables and the corresponding direct causes (i.e., observed parents and unobserved exogenous terms). For the purpose of illustration, in Figure 4 we present the DAG (at time step T = t) with the exogenous terms E t explicitly modeled, where E t is the concatenation of individual exogenous terms: E t = ϵ Ht , ϵ Y (obs) t , ϵ Y (ori) t , ϵ Xt,i , ϵ Dt . (B.1) We use the • symbol on certain exogenous noise terms, e.g., ϵ Ht and ϵ Y (obs) t , to denote the fact that the corresponding variables are affected by previous time step (T = t -1), and such influence are encapsulated into exogenous terms from the standpoint of current time step (T = t). For example, the influence from the randomness in D t-1 (when T = t -1) on current Y (obs) t is encapsulated into an exogenous term ϵ Y (obs) t when T = t. As we can see from Figure 4 The noise terms E t are the unobserved exogenous terms that signify the unique characteristics of an individual. The utilization of such uniqueness of individual can be found in the estimation of counterfactual causal effect by making use of the posterior distribution of exogenous noise terms conditioning on the observed features, e.g., Kusner et al. (2017) . B.4 DETAILED DERIVATION OF ∆ STIR | t+1 t (SECTION 3.2.1) In this section, we provide the derivation detail of the Single-step Tier Imbalance Reduction (STIR) term:  ∆ STIR | t+1 t := E [ |f t+1 (0, E t+1 ) -f t+1 (1, E t+1 )| ] -E [ |f t (0, E t ) -f t (1, E t )| ] (B. q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ := q t f t (0, ϵ), f t (1, ϵ) | g D t (0, ϵ) = d, g D t (1, ϵ) = d ′ , g Y (ori) t (0, ϵ) = y, g Y (ori) t (1, ϵ) = y ′ = ξ∈E 1{f t (0, ξ) = f t (0, ϵ), f t (1, ξ) = f t (1, ϵ)} • p t E t = ξ | g D t (0, ϵ) = d, g D t (1, ϵ) = d ′ , g Y (ori) t (0, ϵ) = y, g Y (ori) t (1, ϵ) = y ′ dξ, (B.3) where 1{•} is the indicator function, and the subscript t of the conditional probability densities (e.g., q t (•) and p t (•)) indicates that they (might) change over time with different time step T = t. The functional form of f t can be convoluted and it is not necessarily the case that f t (0, •) and f t (1, •) are injective mappings E → (0, 1]. Therefore, for the purpose of generality, in Equation B.3 we explicitly introduce the identity function 1{f t (0, ξ) = f t (0, ϵ), f t (1, ξ) = f t (1, ϵ)} when characterizing the conditional joint density q t .

B.4.2 CAPTURING SITUATION CHANGES FOR AN INDIVIDUAL

For a specific individual (j), given the value of individual's exogenous terms E (j) t = e (j) t , let us denote the difference between f t (0, e (j) t ) and f t (1, e (j) t ) as φ t (e (j) t ) := f t (0, e (j) t ) -f t (1, e (j) t ), and the sum of f t (0, e (j) t ) and f t (1, e (j) t ) as η t (e (j) t ) := f t (0, e (j) t ) + f t (1, e (j) t ). We introduce φ t (•) and η t (•) for the conciseness of notation, and we can always map φ t (•), η t (•) back to f t (0, •), f t (1, •) via a coordinate transformation: f t (0, e (j) t ) f t (1, e (j) t ) = √ 2 2 cos π 4 sin π 4 -sin π 4 cos π 4 φ t (e (j) t ) η t (e (j) t ) . (B.4) Let us consider the connection between φ t+1 (e (j) t+1 ) = f t+1 (0, e (j) t+1 ) -f t+1 (1, e (j) t+1 ) in the time step T = t + 1 and φ t (e (j) t ) = f t (0, e (j) t ) -f t (1, e (j) t ) in the time step T = t. We use different time step subscripts for the exogenous terms, e.g., e t , even if we are focusing on the same individual from T = t to T = t + 1. Nevertheless, for the given functional forms of f t , g D t , g Y (ori) t , the combination of (d (j) , d ′(j) , y (j) , y ′(j) ), the value of exogenous term e (j) t in the initial situation of the current one-step analysis (when T = t), and the hyperparameters (α D , α Y ), we can uniquely derive the value of φ t+1 (e (j) t+1 ) = f t+1 (0, e (j) t+1 ) -f t+1 (1, e t+1 ), and list all possible instantiations of φ t+1 (e  and Table 6  (j) t+1 ) in Table 4 (if α D > α Y ), Table 5 (if α D < α Y ), (if α D = α Y ). Let us denote such mapping from φ t (e (j) t ) to φ t+1 (e (j) t+1 ) with the function G t . For the purpose of simplifying notations, we can omit the superscript (j) if without ambiguity, since the value of exogenous terms e T signify the unique characteristics of an individual: φ t+1 (e t+1 ) := f t+1 (0, e t+1 ) -f t+1 (1, e t+1 ) = G t (f t , g D t , g Y (ori) t ; d, d ′ , y, y ′ , e t , α D , α Y ). (B.5) Notice that the value of the function G t only relies on the information available at time step T = t.

B.4.3 AGGREGATING INDIVIDUAL-LEVEL SITUATION CHANGES

We can calculate Single-step Tier Imbalance Reduction, i.e., the term ∆ STIR | t+1 t , as following: ∆ STIR | t+1 t := E [ |f t+1 (0, E t+1 ) -f t+1 (1, E t+1 )| ] -E [ |f t (0, E t ) -f t (1, E t )| ] (i) = E [ |φ t+1 (E t+1 )| -|φ t (E t )| ] (ii) = E E |φ t+1 (ξ)| -|φ t (ϵ)| E t+1 = ξ, The value of exogenous terms of an individual take value ξ at T = t + 1. E t = ϵ, The value of exogenous terms of an individual take value ϵ at T = t. φ t+1 (ξ) = G t (f t , g D t , g Y (ori) t ; d, d ′ , y, y ′ , ϵ, α D , α Y ) This is to make sure that we are keeping track of the same individual in the sense that, φt+1(ξ) when T = t + 1 is indeed a valid instantiation from φt(ϵ) when T = t. If φt+1(•) is not a valid instantiation from φt(•), the contribution to the expectation is 0. (iii) = d,d ′ ,y,y ′ ∈{0,1} P t (d, d ′ , y, y ′ ) • ϵ∈E ξ∈E q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ • |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G t (f t , g D t , g Y (ori) t ; d, d ′ , y, y ′ , ϵ, α D , α Y )}dξdϵ, (B.6) where the equality (i) is based on the definition of φ t (•) and φ t+1  (ξ) = G t (f t , g D t , g Y (ori) t ; d, d ′ , y, y ′ , ϵ, α D , α Y ) } makes sure that we are keeping track of the same individual (whose exogenous noise term equals to ϵ at time t) before and after the one-step dynamic, even if his/her exogenous noise term equals to ξ at time t + 1, and that ϵ might not be equal to ξ. The fact that we are keeping track of the same individual also justifies the practice of only integrating over (conditional) densities with subscript t, e.g., q t (•) and P t (•), instead of both t and t + 1. To see this from a different angle, keeping track of situation changes of each individual (when comparing φ t+1 (e t+1 ) with φ t (e t )) also alleviates us from the trouble of estimating (conditional) densities that involve future information. At the time step t, we do not know the densities In this subsection, we provide further illustrations of the connection between Assumption 3.5 and Assumption 3.6. In Figure 5 we present the one-step update of the conditional joint distribution q T f T (0, e T ), f T (1, e T ) | d, d ′ , y, y ′ from T = t to T = t + 1 (we present the case when y < y ′ as an example). For panel (a) and (b), the joint distribution of (f T (0, E T ), f T (1, E T )) is plotted before and after one-step dynamics, with quantitative and qualitative assumptions respectively. The distributions are color-coded, the deeper the color, the larger the value of the joint density. q t+1 f t+1 (0, E t+1 ), f t+1 (1, E t+1 ) | d, d ′ , Compared to the qualitative assumption (Assumption 3.5, illustrated in Figure 5b ), the quantitative assumption (Assumption 3.6, illustrated in Figure 5a ) is just a special case, with quantitative characteristics built-in for technical purposes (we will make use of Assumption 3.6 in the proofs for Theorem 4.2 and Theorem 4.3 in Appendix C). From the illustrations in Figure 5 , we can also see that the behaviors of the one-step update of conditional joint density under qualitative and quantitative assumptions are similar, with deeper color patterns occurring on the upper-left corner, indicating similar changes in the corresponding conditional joint densities q T f T (0, e T ), f T (1, e T ) | d, d ′ , y, y ′ (when y < y ′ , and from T = t to T = t + 1).

B.6 ADDITIONAL EXPERIMENTAL RESULTS

In this section, we present additional experimental results on the preprocessed FICO credit score data set (Board of Governors of the Federal Reserve System (US), 2007; Hardt et al., 2016) . Similar to the experiment summarized in Figure 3 , we convert the cumulative distribution function (CDF) of group-wise TransRisk scores into group-wise density distributions of the credit score, and use them as the initial tier distributions for different groups. We consider utility-maximizing decision-making strategies, i.e., the decision-making policy is accuracy oriented and there is no explicit fairness consideration. In Figure 6 we present the summary of a 20-step interplay between decision with accuracy-oriented predictors and the underlying data generating process on the credit score data set. The accuracy-oriented decision-making strategy is retrained after each one-step data dynamics. From Figure 6 (a), there is no obvious evidence that the gap between step-by-step tracks of tiers for different groups is decreasing over time. This observation aligns with our theoretical analysis (Theorem 4.2) and simulation results (Figure 2 ) for perfect predictors.

B.7 POTENTIAL LIMITATIONS OF OUR WORK

In this subsection, we discuss potential limitations of our work.

B.7.1 SPECIFIED DYNAMICS VS. CAUSAL DISCOVERY

In this paper, we present a detailed causal modeling of decision-distribution interplay on DAG (Section 2.1) and formulate the dynamic fairness notion, Tier Balancing, that captures the long-term fairness goal over the underlying causal factor. The research of causal discovery, where the goal is to discover the causal relations among variables (Spirtes et al., 1993; Shimizu et al., 2006; Zhang & Hyvärinen, 2009; Zhang et al., 2011) , is a highly relevant area but is out of the scope of our paper. Our Tier Balancing notion of dynamic fairness, as well as our analyzing framework, does not rely on a causal model derived from causal discovery. As we discussed in the comparison with previous literature in dynamic fairness studies (Appendix A), our causal model is richer and more complex, which provides the potential of a more principled reasoning of the essential decision-distribution interplay in the pursuit of long-term fairness. We acknowledge the fact that it is nice to have the ability to discover the underlying causal model of the involved dynamics, which would provide further refinements of our dynamic modeling based on the specific practical scenario of interest. Causal discovery can act as the icing on the cake, but not a necessary component, of our analysis.

B.7.2 THE NUMBER AND DIMENSION OF LATENT CAUSAL FACTORS

In the causal modeling of decision-distribution interplay we present in the paper, we consider one latent causal factor that carries on the influence of current decision to future distributions. Recent developments in the identification of causal structures that involve (more than one) latent factors (Xie et al., 2020; Adams et al., 2021; Kivva et al., 2021; Xie et al., 2022) provide not only a theoretical justification, but also an indication of the potential, for our effort in exploring long-term fairness inquires over latent causal factors. We believe that our detailed causal modeling of decisiondistribution interplay (on both observed variables and latent causal factors) and our formulation of Tier Balancing notion of long-term fairness act as an important first step.

C PROOF OF RESULTS

In this section, we provide proofs for results presented in the paper. For better readability, we provide an additional Proof (sketch) before proving Theorem 4.1 (proof in Appendix C.2), Theorem 4.2 (proof in Appendix C.3), and Theorem 4.3 (proof in Appendix C.4), respectively. C.1 PROOF FOR PROPOSITION 3.1 Proposition. At time step T = t, for any H t = h t ∈ (0, 1], under the specified dynamics, among the population where ground truth is actually observable, i.e., Y (obs) t is not undefined, we have: Y (obs) t ∼ Bernoulli(h t ). Proof. To begin with, according the d-separation relation among D t-1 , H t , and Y (ori) t on Figure 1, we notice that Y (ori) t ⊥ ⊥ D t-1 | H t . Therefore we have: Y (ori) t ∼ Bernoulli(h t ), P (Y (ori) t = 1 | H t = h t ) = h t , P (D t-1 = 1 | H t = h t ) = d(h t ), where d(•) is a function d : (0, 1] → [0, 1]. Notice that there is no claim that D t-1 can be uniquely determined by a function of only h t . We only represent the conditional probability mass P (D t-1 = 1 | H t = h t ) with a function of h t without specifying the exact functional form. In fact, as we shall see in the later part of this proof, the exact functional form of d(•) does not affect the validity of the result. Since Y (obs) t is in fact Y (ori) t masked by D t-1 , i.e., Y (obs) t is observable only when D t-1 = 1 and is undefined when D t-1 = 0, we have: P (D t-1 = 0, Y (obs) t = 0 | H t = h t ) = P (D t-1 = 0, Y (obs) t = 1 | H t = h t ) = 0. This indicates that among the population where Y (obs) t is not undefined (the population itself may change at different time step), ∀y ∈ {0, 1}: P (Y (obs) t = y | H t = h t ) = P (D t-1 = 0, Y (obs) t = y | H t = h t ) + P (D t-1 = 1, Y (obs) t = y | H t = h t ) = P (D t-1 = 1, Y (obs) t = y | H t = h t ) = P (D t-1 = 1, Y (ori) t = y | H t = h t ). Then, when d(h t ) ∈ (0, 1), we can calculate the following probability: P (Y (obs) t = 1 | H t = h t ) = P (Y (obs) t = 1 | H t = h t ) P (Y (obs) t = 1 | H t = h t ) + P (Y (obs) t = 0 | H t = h t ) = P (D t-1 = 1, Y (ori) t = 1 | H t = h t ) P (D t-1 = 1, Y (ori) t = 1 | H t = h t ) + P (D t-1 = 1, Y (ori) t = 0 | H t = h t ) = d(h t )h t d(h t )h t + d(h t )(1 -h t ) =h t =P (Y (ori) t = 1 | H t = h t ); when d(h t ) = 1, this indicates that if H t = h t , we know for sure that this individual received a positive decision in the previous time step (when T = t -1), and we have Y (ori) t = Y (obs) t by definition; when d(h t ) = 0, this indicates that if H t = h t , we know for sure that this individual did not receive a positive decision in the previous time step (when T = t -1), and in this case Y (obs) t is undefined. Therefore, among the population where ground truth is actually observable, i.e., Y (obs) t is not undefined, we have: Y (obs) t ∼ Bernoulli(h t ). C.2 PROOF FOR THEOREM 4.1 Theorem. Let us consider the general situation where both D t and Y (ori) t are dependent with A t , i.e., D t ⊥ ̸ ⊥ A t , Y (ori) t ⊥ ̸ ⊥ A t . Then under Fact 3.2, Assumption 3.3, and Assumption 3.4, as well as the specified dynamics, when H t ⊥ ̸ ⊥ A t , only if at least one of the following conditions holds true for all e t ∈ E can we possibly attain H t+1 ⊥ ⊥ A t+1 : (1) The ratio ft(0,et) ft(1,et) has a specific domain of value: f t (0, e t ) f t (1, e t ) = 1 ± α D ± α Y 1 ± α D ± α Y ; (2) Positive (negative) labels only appear in the advantaged (disadvantaged) group, and the decision for everyone is positive (if α D > α Y ):          f t (0, e t ) ∈ [ 1 1+α D -α Y , 1], f t (1, e t ) ∈ [ 1 1+α D +α Y , 1], g Y (ori) t (0, e t ) = 0, g Y (ori) t (1, e t ) = 1, g D t (0, e t ) = g D t (1, e t ) = 1; (3) Negative (positive) labels only appear in the advantaged (disadvantaged) group, and the decision for everyone is positive (if α D > α Y ):          f t (0, e t ) ∈ [ 1 1+α D +α Y , 1], f t (1, e t ) ∈ [ 1 1+α D -α Y , 1], g Y (ori) t (0, e t ) = 1, g Y (ori) t (1, e t ) = 0, g D t (0, e t ) = g D t (1, e t ) = 1; (4) Everyone has a positive label, but the positive decision is exclusive to the advantaged group (if α D < α Y ):          f t (0, e t ) ∈ [ 1 1-α D +α Y , 1], f t (1, e t ) ∈ [ 1 1+α D +α Y , 1], g Y (ori) t (0, e t ) = g Y (ori) t (1, e t ) = 1, g D t (0, e t ) = 0, g D t (1, e t ) = 1; (5) Everyone has a positive label, but the positive decision is exclusive to the disadvantaged group (if α D < α Y ):          f t (0, e t ) ∈ [ 1 1+α D +α Y , 1], f t (1, e t ) ∈ [ 1 1-α D +α Y , 1], g Y (ori) t (0, e t ) = g Y (ori) t (1, e t ) = 1, g D t (0, e t ) = 1, g D t (1, e t ) = 0. Proof (sketch). In order to see the exact condition under which it is possible to achieve H t+1 ⊥ ⊥ A t+1 , we consider the necessary and sufficient condition such that H t+1 = f t+1 (A t+1 , E t+1 ) is not a function of A t+1 . This, together with Fact 3.2, Assumption 3.3, and Assumption 3.4, indicates that we need to consider the condition under which H t+1 = min 1, f t (A t , E t ) 1 + α D (2D t -1) + α Y (2Y (ori) t -1) is not a function of A t . Since both D t and Y (ori) t are binary, we can exhaustively consider all value combinations of D t and Y (ori) t , and list every possible value H t+1 can take in each case in Table 1  (if α D > α Y ), Table 2 (if α D < α Y ), or Table 3 (if α D = α Y ). By exhaustively going through possible cases, we can have a full picture of the update of H t+1 based on (H t , Y (ori) t , D t ), and then derive conditions under which H t+1 is not a function of A t , i.e., we have the conditions under which it is possible to attain H t+1 ⊥ ⊥ A t+1 . Proof (full). In order to see the exact condition under which it is possible to achieve H t+1 ⊥ ⊥A t+1 , we consider the necessary and sufficient condition such that H t+1 = f t+1 (A t+1 , E t+1 ) is not a function of A t+1 . By Fact 3.2, Assumption 3.3, and Assumption 3.4, it is necessary and sufficient to consider the condition under which H t+1 = min 1, f t (A t , E t ) 1 + α D (2D t -1) + α Y (2Y (ori) t -1) is not a function of A t . Considering the fact that both D t and Y (ori) t are binary, we can compare the values of H t+1 when A t = 0 and A t = 1 for all possible combinations of D t and Y (ori) t . For any fixed e t ∈ E, we can list all the cases in Table 1  (if α D > α Y ), Table 2 (if α D < α Y ), or Table 3 (if α D = α Y ) , and see if for all e t ∈ E, there is no difference in the value of H t+1 between the cases when A t = 0 and A t = 1. From Table 1, Table 2, and Table 3 , we can see that if and only the following hold true can we achieve H t+1 ⊥ ⊥ A t+1 : for every e t ∈ E, whenever the joint probability P g D t (0, e t ) = d, g Y (ori) t (0, e t ) = y, g D t (1, e t ) = d ′ , g Y (ori) t (1, e t ) = y ′ is nonzero, the last two columns of the corresponding row(s) in the table, i.e., the exact values of H t+1 , need to match. For example, when α D > α Y , if we know P g D t (0, e t ) = 0, g Y (ori) t (0, e t ) = 0, g D t (1, e t ) = 0, g Y (ori) t (1, e t ) = 0 ̸ = 0, we need the last two columns of Case (i) of Table 1 to equal to each other, i.e., we need f t (0, e t ) = f t (1, e t ) to hold true. Without further assumptions on the joint distribution of the data, we do not know which combination of (d, y, d ′ , y ′ ) will result in a nonzero joint probability: P g D t (0, e t ) = d, g Y (ori) t (0, e t ) = y, g D t (1, e t ) = d ′ , g Y (ori) t (1, e t ) = y ′ ̸ = 0. However, considering the fact that d,d ′ ∈D,y,y ′ ∈Y P g D t (0, e t ) = d, g Y (ori) t (0, e t ) = y, g D t (1, e t ) = d ′ , g Y (ori) t (1, e t ) = y ′ = 1 holds for all e t ∈ E, we do know that for any fixed e t ∈ E, there is at least one possible instantiation of (d * , y * , d ′ * , y ′ * ) such that: P g D t (0, e t ) = d * , g Y (ori) t (0, e t ) = y * , g D t (1, e t ) = d ′ * , g Y (ori) t (1, e t ) = y ′ * ̸ = 0. (C.1) Let us first consider situations where α D > α Y and focus on Table 1 . The analysis on situations where α D < α Y (i.e., Table 2 ) or α D = α Y (i.e, Table 3 ), is of the same flavor and therefore we omit the detail in the proof. To begin with, we can observe that not every entry of the last two columns explicitly keeps the min{•, 1} operator. On the one hand, since α D > α Y (α D , α Y ∈ [0, 1 2 ), as of Assumption 3.3), we have (1α D ± α Y ) ∈ (0, 1) and f t (a t , e t )(1α D ± α Y ) ∈ (0, 1) (since H t = f t (a t , e t ) ∈ (0, 1]); therefore, we do not need to keep the min{•, 1} operator explicit, for instance, in the second to last column of Case (v -viii) . On the other hand, when the coefficients (1 + α D ± α Y ) > 1 we are not sure if f t (a t , e t )(1 + α D ± α Y ) exceed 1; therefore, we need to keep the min{•, 1} operator explicit, for instance, in the last column of Case (v -viii). Besides, if only one entry of the last two columns explicitly has the min{•, 1} operator, it is equivalent to require that the terms themselves (before applying the operator) are equal (since the one without the min{•, 1} operator is known to be within the (0, 1) interval). For instance, Case (ix) requires that min{f t (0, e t )(1 + α D -α Y ), 1} = f t (1, e t )(1 -α D -α Y ), which is equivalent to requiring f t (0, e t )(1 + α D -α Y ) = f t (1, e t )(1 -α D -α Y ). Furthermore, if both entries of the last two columns explicitly has the min{•, 1} operator, the exact condition of matching the last two columns depends on the actual value of f t (0, e t ) and f t (1, e t ). For instance, Case (xv) requires that min{f t (0, e t )(1+α D +α Y ), 1} = min{f t (1, e t )(1+α D -α Y ), 1}, which could be equivalent to one of the following conditions (recall that 1 + α D ± α Y > 1): • if we have f t (0, e t ) ∈ [ 1 1+α D +α Y , 1] and f t (1, e t ) ∈ [ 1 1+α D -α Y , 1], we require 1 = 1, which trivially holds true; • if we have f t (0, e t ) ∈ (0, 1 1+α D +α Y ) and f t (1, e t ) ∈ [ 1 1+α D -α Y , 1], we require f t (0, e t )(1 + α D + α Y ) = 1, which cannot hold true; • if we have f t (0, e t ) ∈ [ 1 1+α D +α Y , 1] and f t (1, e t ) ∈ (0, 1 1+α D -α Y ), we require 1 = f t (1, e t )(1 + α D -α Y ) which cannot hold true; • if we have f t (0, e t ) ∈ (0, 1 1+α D +α Y ) and f t (1, e t ) ∈ (0, 1 1+α D -α Y ), we require ft(0,et) ft(1,et) = 1+α D -α Y 1+α D +α Y . Recall that without further assumptions on the data distribution, we do not know which row(s) of the table correspond to a nonzero probability P g D t (0, e t ) = d, g Y (ori) t (0, e t ) = y, g D t (1, e t ) = d ′ , g Y (ori) t (1, e t ) = y ′ . As a result, in general, we do not know which set of requirements we should enforce for each e t ∈ E. Therefore, we cannot derive a necessary and sufficient condition for attaining H t+1 ⊥ ⊥ A t+1 in general cases. Nevertheless, we can summarize the previous analysis and derive the necessary condition of attaining H t+1 ⊥ ⊥ A t+1 , i.e., only if at least one of the following conditions holds true for all e t ∈ E can we possibly attain H t+1 ⊥ ⊥ A t+1 : (1) The ratio ft(0,et) ft (1,et) has a specific domain of value: f t (0, e t ) f t (1, e t ) = 1 ± α D ± α Y 1 ± α D ± α Y ; (2) Positive (negative) labels only appear in the advantaged (disadvantaged) group, and the decision for everyone is positive (if α D > α Y ):          f t (0, e t ) ∈ [ 1 1+α D -α Y , 1], f t (1, e t ) ∈ [ 1 1+α D +α Y , 1], g Y (ori) t (0, e t ) = 0, g Y (ori) t (1, e t ) = 1, g D t (0, e t ) = g D t ( 1, e t ) = 1; (3) Negative (positive) labels only appear in the advantaged (disadvantaged) group, and the decision for everyone is positive (if α D > α Y ):          f t (0, e t ) ∈ [ 1 1+α D +α Y , 1], f t (1, e t ) ∈ [ 1 1+α D -α Y , 1], g Y (ori) t (0, e t ) = 1, g Y (ori) t (1, e t ) = 0, g D t (0, e t ) = g D t (1, e t ) = 1; (4) Everyone has a positive label, but the positive decision is exclusive to the advantaged group (if α D < α Y ):          f t (0, e t ) ∈ [ 1 1-α D +α Y , 1], f t (1, e t ) ∈ [ 1 1+α D +α Y , 1], g Y (ori) t (0, e t ) = g Y (ori) t (1, e t ) = 1, g D t (0, e t ) = 0, g D t (1, e t ) = 1; (5) Everyone has a positive label, but the positive decision is exclusive to the disadvantaged group (if (ori) t ⊥ ̸ ⊥ A t . Under Fact 3.2, Assumption 3.3, Assumption 3.4, and Assumption 3.6, as well as the specified dynamics, when H t ⊥ ̸ ⊥ A t , the perfect predictor does not have the potential to get closer to the long-term fairness goal after one-step intervention, i.e., α D < α Y ):          f t (0, e t ) ∈ [ 1 1+α D +α Y , 1], f t (1, e t ) ∈ [ 1 1-α D +α Y , 1], g Y (ori) t (0, e t ) = g Y (ori) t (1, e t ) = D t = Y (ori) t =⇒ ∆ (Perfect Predictor) STIR t+1 t > 0. Proof (sketch). The goal is to calculate if it is possible for Single-step Tier Imbalance Reduction ∆ STIR | t+1 t to be smaller than 0 when using perfect predictors. As defined in Equation 5, ∆ STIR | t+1 t is a weighted aggregation (integration followed by summation) of |φ(e t+1 )| -|φ(e t )|. The quantitative analysis involves three key components: instantiations of φ t+1 (e t+1 ), the knowledge/assumptions on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ , and characteristics of P t (d, d ′ , y ′ , y ′ ). For the first component, we can list all possible instantiations of φ t+1 (e t+1 ) in Table 4  and Table 6 (if α D = α Y ), respectively. For the second component, we can introduce a quantitative assumption on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ (Assumption 3.6). For the third component, we need to exploit the characteristic of the predictor of interest to gain further insight into the joint distribution P t (d, d ′ , y, y ′ ). For perfect predictors, we have P t (d, d ′ , y, y ′ ) satisfies Equation 8(as we have discussed in Section 4.2.1). (if α D > α Y ), Table 5 (if α D < α Y ), For the purpose of calculating the value of ∆ STIR | t+1 t , the proof contains two steps: (1) exhaustively derive the value of |φ(e t+1 )| -|φ(e t )| after one-step dynamics in all possible cases, and (2) aggregate the difference |φ(e t+1 )| -|φ(e t )| with the help of the additional knowledge/assumptions on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ and P t (d, d ′ , y, y ′ ).  + α Y + 2)f t (0, e t ) + (α D + α Y -2)f t (1, e t ) > 0 if we have      f t (0, e t ) ∈ (0, 1 1+α D +α Y ), f t (1, e t ) ∈ (0, 1] f t (1, e t ) < tan 3π 4 -arctan 2 α D +α Y f t (1, e t ) ≥ f t (0, e t ) ; (xi.1.4) |φ(e t+1 )| -|φ(e t )| = (α D + α Y ) f t (0, e t ) + f t (1, e t ) > 0 if we have f t (0, e t ) ∈ (0, 1 1+α D +α Y ), f t (1, e t ) ∈ (0, 1] f t (1, e t ) < f t (0, e t ) ; (xi.2.1) |φ(e t+1 )| -|φ(e t )| = 1 + f t (0, e t ) -(2 -α D -α Y )f t (1, e t ) > 0 if we have f t (0, e t ) ∈ [ 1 1+α D +α Y , 1], f t (1, e t ) ∈ (0, 1] f t (1, e t ) ≥ f t (0, e t ) ; (xi.2.2) |φ(e t+1 )| -|φ(e t )| = 1 -f t (0, e t ) + (α D + α Y )f t (1, e t ) > 0 if we have f t (0, e t ) ∈ [ 1 1+α D +α Y , 1], f t (1, e t ) ∈ (0, 1] f t (1, e t ) < f t (0, e t ) . When (d, d ′ , y, y ′ ) = (0, 0, 0, 0), i.  ∆ (Perfect Predictor) STIR t+1 t = P t (0, 1, 0, 1) • ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 0, 1, 0, 1, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 0, 1, 0, 1 dξdϵ + P t (1, 0, 1, 0) • ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 1, 0, 1, 0, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 1, 0, 1, 0 dξdϵ. (C.2) As we can see from Equation C.2, we need to perform two-dimensional integrations on the f t (0, e t ), f t (1, e t ) plane, calculating the expectation of the term |φ(e t+1 )| -|φ(e t )| over the conditional densities q t f t (0, ϵ), f t (1, ϵ) | 0, 1, 0, 1 and q t f t (0, ϵ), f t (1, ϵ) | 1, 0, 1, 0 . Since these conditional joint densities could be convoluted in general cases, the calculation of conditional expectations in Equation C.2 could be rather complicated. Therefore, we propose to take advantage of Assumption 3.6 to quantitatively simplify the calculation yet remain consistent with the rather mild qualitative assumption (Assumption 3.5), and derive a result that is numerically clear and informative. For the purpose of better illustrating the connection between (qualitative and quantitative) assumptions on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ and the computation of ∆ STIR | t+1 t , we also provide illustrative figures as shown in Figure 7 . With the help of Assumption 3.6, we convert the conditional expectations in Equation C.2 into calculations of multiple integrals on slices within a 1 × 1 square on the 2-D plane, where ϕ 0 and ϕ 1 axes correspond to the value of f t (0, E t ) and f t (1, E t ) respectively: ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 0, 1, 0, 1, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 0, 1, 0, 1 dξdϵ = γ (low) 0101 • 1 0 tan arctan 1 α D +α Y -π 4 ϕ0 0 -(α D + α Y )(ϕ 0 + ϕ 1 ) dϕ 1 dϕ 0 + 1 1+α D +α Y 0 ϕ0 tan arctan 1 α D +α Y -π 4 ϕ0 (α D + α Y -2)ϕ 0 + (α D + α Y + 2)ϕ 1 dϕ 1 dϕ 0 + 1 1 1+α D +α Y 1 1+α D +α Y tan arctan 1 α D +α Y -π 4 ϕ0 (α D + α Y -2)ϕ 0 + (α D + α Y + 2)ϕ 1 dϕ 1 dϕ 0 + 1 1 1+α D +α Y ϕ0 1 1+α D +α Y 1 -(2 -α D -α Y )ϕ 0 + ϕ 1 dϕ 1 dϕ 0 + γ (up) 0101 • 1 1+α D +α Y 0 1 1+α D +α Y ϕ0 (α D + α Y )(ϕ 0 + ϕ 1 ) dϕ 1 dϕ 0 + 1 1+α D +α Y 0 1 1 1+α D +α Y 1 + (α D + α Y )ϕ 0 -ϕ 1 dϕ 1 dϕ 0 + 1 1 1+α D +α Y 1 ϕ0 1 + (α D + α Y )ϕ 0 -ϕ 1 dϕ 1 dϕ 0 , ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 1, 0, 1, 0, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 1, 0, 1, 0 dξdϵ = γ (up) 1010 • 1 0 tan -1 3π 4 -arctan 1 α D +α Y ϕ1 0 -(α D + α Y )(ϕ 0 + ϕ 1 ) dϕ 0 dϕ 1 + 1 1+α D +α Y 0 ϕ1 tan -1 3π 4 -arctan 1 α D +α Y ϕ1 (α D + α Y + 2)ϕ 0 + (α D + α Y -2)ϕ 1 dϕ 0 dϕ 1 + 1 1 1+α D +α Y 1 1+α D +α Y tan -1 3π 4 -arctan 1 α D +α Y ϕ1 (α D + α Y + 2)ϕ 0 + (α D + α Y -2)ϕ 1 dϕ 0 dϕ 1 + 1 1 1+α D +α Y ϕ1 1 1+α D +α Y 1 + ϕ 0 -(2 -α D -α Y )ϕ 1 dϕ 0 dϕ 1 + γ (low) 1010 • 1 1+α D +α Y 0 1 1+α D +α Y ϕ1 (α D + α Y )(ϕ 0 + ϕ 1 ) dϕ 0 dϕ 1 + 1 1+α D +α Y 0 1 1 1+α D +α Y 1 -ϕ 0 + (α D + α Y )ϕ 1 dϕ 0 dϕ 1 + 1 1 1+α D +α Y 1 ϕ1 1 -ϕ 0 + (α D + α Y )ϕ 1 dϕ 0 dϕ 1 . Since γ (low) 0101 + γ (up) 0101 = 2 and γ (low) 1010 + γ (up) 1010 = 2, we can derive the form of ∆ (Perfect Predictor) STIR t+1 t : ∆ (Perfect Predictor) STIR t+1 t = P t (0, 1, 0, 1) • γ (low) 0101 + P t (1, 0, 1, 0) • γ (up) 1010 • - 1 + α D + α Y 3 • tan 2 arctan 1 α D + α Y - π 4 + 2(1 -α D -α Y ) 3 • tan arctan 1 α D + α Y - π 4 - 1 -α D -α Y 6 + 3 -α D -α Y 2(1 + α D + α Y ) + 3(α D + α Y ) 3 -6(α D + α Y ) 2 -19(α D + α Y ) -10 6(1 + α D + α Y ) 3 + P t (0, 1, 0, 1) + P t (1, 0, 1, 0) • α D + α Y -2 3 + α D + α Y 1 + α D + α Y + 3(α D + α Y ) 2 + 5(α D + α Y ) + 2 3(1 + α D + α Y ) 3 , where γ (low) 0101 , γ (up) 1010 ∈ (0, 1) (according to Assumption 3.6), and α D , α Y ∈ [0, 1 2 ) (according to Assumption 3.3).

Let us denote

β(α D , α Y ) := tan arctan 1 α D +α Y -π 4 to simplify the notation. Without loss of generality let us assume that P t (0, 1, 0, 1) • γ (low) 0101 + P t (1, 0, 1, 0) • γ (up) 1010 > 0. We can further compute the partial derivatives and find out that: ∂∆ (Perfect Predictor) STIR t+1 t ∂ P t (0, 1, 0, 1) • γ (low) 0101 + P t (1, 0, 1, 0) • γ (up) 1010 = - 1 + α D + α Y 3 • β 2 (α D , α Y ) + 2(1 -α D -α Y ) 3 • β(α D , α Y ) - 1 -α D -α Y 6 + 3 -α D -α Y 2(1 + α D + α Y ) + 3(α D + α Y ) 3 -6(α D + α Y ) 2 -19(α D + α Y ) -10 6(1 + α D + α Y ) 3 < 0, ∀ α D , α Y ∈ [0, 1 2 ), and that: ∂∆ (Perfect Predictor) STIR t+1 t ∂(α D + α Y ) = P t (0, 1, 0, 1) + P t (1, 0, 1, 0) • 1 3 + 2 3(1 + α D + α Y ) 3 + P t (0, 1, 0, 1) • γ (low) 0101 + P t (1, 0, 1, 0) • γ (up) 1010 • - 2(1 + α D + α Y ) 3 • β(α D , α Y ) • ∂β(α D , α Y ) ∂(α D + α Y ) + 2(1 -α D -α Y ) 3 • ∂β(α D , α Y ) ∂(α D + α Y ) + 1 6 - 2 (1 + α D + α Y ) 2 + 15(α D + α Y ) + 11 6(1 + α D + α Y ) 3 = P t (0, 1, 0, 1) + P t (1, 0, 1, 0) • 1 3 + 2 3(1 + α D + α Y ) 3 + P t (0, 1, 0, 1) • γ (low) 0101 + P t (1, 0, 1, 0) • γ (up) 1010 • 2 1 + β 2 (α D , α Y ) • (1 + α D + α Y )β(α D , α Y ) + α D + α Y -1 3(1 + α D + α Y ) 3 + (α D + α Y ) 3 + 3(α D + α Y ) 2 + 6(α D + α Y ) 6(1 + α D + α Y ) 3 > 0, ∀ γ (low) 0101 , γ (up) 1010 ∈ (0, 1), α D , α Y ∈ [0, 1 2 ), where we utilize the fact that ∂β(α D ,α Y ) ∂(α D +α Y ) = 1 + β 2 (α D , α Y ) • Therefore, we can conclude that ∆ (Perfect Predictor) STIR t+1 t > lim γ (low) 0101 →1 γ (up) 1010 →1 α D +α Y →0 P t (0, 1, 0, 1) • γ (low) 0101 + P t (1, 0, 1, 0) • γ (up) 1010 • - 1 + α D + α Y 3 • tan 2 arctan 1 α D + α Y - π 4 + 2(1 -α D -α Y ) 3 • tan arctan 1 α D + α Y - π 4 - 1 -α D -α Y 6 + 3 -α D -α Y 2(1 + α D + α Y ) + 3(α D + α Y ) 3 -6(α D + α Y ) 2 -19(α D + α Y ) -10 6(1 + α D + α Y ) 3 + P t (0, 1, 0, 1) + P t (1, 0, 1, 0) • α D + α Y -2 3 + α D + α Y 1 + α D + α Y + 3(α D + α Y ) 2 + 5(α D + α Y ) + 2 3(1 + α D + α Y ) 3 = 0, i.e., under the specified assumptions and dynamics, we have 2 ), α Y = 0. Then under Fact 3.2, Assumption 3.3, and Assumption 3.4, and Assumption 3.6, as well as the specified dynamics, when H t ⊥ ̸ ⊥ A t , it is possible for the Counterfactual Fair predictor to get closer to the long-term fairness goal after one-step intervention, if certain properties of the data dynamics and the predictor behavior are satisfied simultaneously, i.e., ∀ γ (low) 0101 , γ (up) 1010 ∈ (0, 1), α D , α Y ∈ [0,            g D t (0, E t ) = g D t (1, E t ) Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0) < 27 8 α D ∈ Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0) 1 3 -1, 1 2 α Y = 0 =⇒ ∆ (Counterfactual Fair) STIR t+1 t < 0. Proof (sketch). Similar to proving Theorem 4.2 (proof in Appendix C.3), the goal is to calculate if it is possible for the Single-step Tier Imbalance Reduction ∆ STIR | t+1 t to be smaller than 0 when using Counterfactual Fair predictors. 5), the quantitative analysis involves three key components: instantiations of φ t+1 (e t+1 ), the knowledge/assumptions on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ , and characteristics of P t (d, d ′ , y ′ , y ′ ). For the first component, since α Y = 0 is a special case of scenarios where α D > α Y , we can list all possible instantiations of φ t+1 (e t+1 ) in Table 4 (when α D > α Y ). For the second component, we can introduce a quantitative assumption on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ (Assumption 3.6). For the third component, we need to exploit the characteristic of the predictor of interest to gain further insight into the joint distribution P t (d, d ′ , y, y ′ ). For Counterfactual Fair predictors, we have P t (d, d ′ , y, y ′ ) satisfies Equation 10(as we have discussed in Section 4.2.2). For the purpose of calculating the value of ∆ STIR | t+1 t , the proof contains two steps: (1) exhaustively derive the value of |φ(e t+1 )| -|φ(e t )| after one-step dynamics (finished in Appendix C.3 when proving Theorem 4.2), and (2) aggregate the difference |φ(e t+1 )| -|φ(e t )| with the help of the additional knowledge/assumptions on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ and P t (d, d ′ , y, y ′ ). Proof (full). Based on the definition of ∆ STIR | t+1 t , the proof calculates the aggregation (integration followed by summation) of the difference |φ(e t+1 )| -|φ(e t )| with the help of the additional knowledge/assumptions on q t f t (0, ϵ), f t (1, ϵ) | d, d ′ , y, y ′ and P t (d, d ′ , y, y ′ ). Since we assume α D ∈ (0, 1 2 ), α Y = 0, we focus on possible instantiations of φ t+1 (e t+1 ) as listed in Table 4 (α D > α Y ). For the Counterfactual Fair predictor that satisfies g D t (0, E t ) = g D t (1, E t ), not every case in Table 4 corresponds to a nonzero P t (d, d ′ , y, y ′ ) and therefore may not contribute to the computation of ∆ STIR | t+1 t as detailed in Equation 5. By applying Equation 10 we need to consider Case (i), Case (ii), Case (iii), Case (iv), Case (xiii), Case (xiv), Case (xv), and Case (xvi) in Table 4 . • q t f t (0, ϵ), f t (1, ϵ) | 0, 0, 0, 1 dξdϵ + P t (0, 0, 1, 0) • • q t f t (0, ϵ), f t (1, ϵ) | 1, 1, 0, 1 dξdϵ + P t (1, 1, 1, 0) • ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 1, 1, 1, 0, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 1, 1, 1, 0 dξdϵ. (C.4) Similar to the proof of the result for perfect predictors presented in Appendix C.3, with the help of Assumption 3.6, we convert the conditional expectations in Equation C.4 into calculations of multiple integrals on slices within a 1 × 1 square on the 2-D plane, where ϕ 0 and ϕ 1 axes correspond to the value of f t (0, E t ) and f t (1, E t ) respectively: t ; 0, 0, 1, 0, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 0, 0, 1, 0 dξdϵ ori) t ; 1, 1, 0, 1, ϵ, α D , α Y )} = γ (up) 0010 • 1 0 1-α D -α Y 1-α D +α Y ϕ1 0 (α D -α Y )ϕ 0 -(α D + α Y )ϕ 1 dϕ 0 dϕ 1 + 1 0 ϕ1 1-α D -α Y 1-α D +α Y ϕ1 (2 -α D + α Y )ϕ 0 -(2 -α D -α Y )ϕ 1 dϕ 0 dϕ 1 + γ (low) 0010 • 1 0 ϕ0 0 -(α D -α Y )ϕ 0 + (α D + α Y )ϕ 1 dϕ 1 dϕ 0 , ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y ( • q t f t (0, ϵ), f t (1, ϵ) | 1, 1, 0, 1 dξdϵ = P t (0, 0, 0, 1) • γ (low) 0001 + P t (0, 0, 1, 0) • γ (up) 0010 • = γ (low) 1101 • 1 1+α D +α Y 0 1+α D +α Y 1+α D -α Y ϕ1 ϕ1 -(2 + α D -α Y )ϕ 0 + (2 + α D + α Y )ϕ 1 dϕ 0 dϕ 1 + 1 1+α D -α Y 0 1+α D -α Y 1+α D +α Y ϕ0 0 (α D -α Y )ϕ 0 -(α D + α Y )ϕ 1 dϕ 1 dϕ 0 + 1 1+α D -α Y 1 1+α D +α Y ϕ0 1 1+α D +α Y 1 -(2 + α D -α Y )ϕ 0 + ϕ 1 dϕ 1 dϕ 0 + γ (up) 1101 • 1 1+α D +α Y 0 ϕ1 0 -(α D -α Y )ϕ 0 + (α D + α Y )ϕ 1 dϕ 0 dϕ 1 + 1 1+α D +α Y 0 1 1 1+α D +α Y 1 -(α D -α Y )ϕ 0 -ϕ 1 dϕ 1 dϕ 0 + 1 1+α D -α Y 1 1+α D +α Y 1 ϕ0 1 -(α D -α Y )ϕ 0 -ϕ 1 dϕ 1 dϕ 0 , ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 1, 1, 1, 0, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 1, (α D -α Y )(1 -α D -α Y ) 2 6(1 -α D + α Y ) 2 - (α D + α Y )(1 -α D -α Y ) 3(1 -α D + α Y ) + 2α Y 3(1 -α D + α Y ) -(2 -α D -α Y ) + (2 -α D + α Y )(1 -α D ) 1 -α D + α Y + α D 6 - α Y 2 + P t (1, 1, 0, 1) • γ (low) 1101 • 2α Y 3(1 + α D -α Y )(1 + α D + α Y ) 3 (2 + α D + α Y ) - (2 + α D -α Y )(1 + α D ) 1 + α D -α Y + 1 3(1 + α D + α Y )(1 + α D -α Y ) 3 (α D -α Y )(1 + α D -α Y ) - (α D + α Y )(1 + α D -α Y ) 2 2(1 + α D + α Y ) - 1 (1 + α D + α Y ) 3 α D 6 + α Y 2 - 2 3 (1 + α D -α Y ) • 1 (1 + α D -α Y ) 3 - 1 (1 + α D + α Y ) 3 + 3 + 2α D 2(1 + α D + α Y ) + 1 + α D -α Y 2 • 1 (1 + α D -α Y ) 2 - 1 (1 + α D + α Y ) 2 - 3 + 2α D + 2α Y 2(1 + α D + α Y ) 2 + 1 2 • 1 1 + α D -α Y - 1 1 + α D + α Y - (α D + α Y )α Y (1 + α D + α Y ) 3 + P t (1, 1, 1, 0) • γ (up) 1110 • 2α Y 3(1 + α D -α Y )(1 + α D + α Y ) 3 (2 + α D + α Y ) - (2 + α D -α Y )(1 + α D ) 1 + α D -α Y + 1 3(1 + α D + α Y )(1 + α D -α Y ) 3 (α D -α Y )(1 + α D -α Y ) - (α D + α Y )(1 + α D -α Y ) 2 2(1 + α D + α Y ) + α D -α Y (1 + α D + α Y )(1 + α D -α Y ) - (α D + α Y )(α D -α Y ) 2(1 + α D + α Y ) 2 (1 + α D -α Y ) - 1 2(1 + α D + α Y ) 1 - 1 (1 + α D -α Y ) 2 - α D 6 + α Y 2 1 (1 + α D + α Y ) 3 + P t (0, 0, 0, 1) + P t (0, 0, 1, 0) • - α D 3 + α Y + P t (1, 1, 0, 1) • 1 (1 + α D + α Y ) 3 α D 3 + α Y + 2(α D + α Y )α Y (1 + α D + α Y ) 3 + 1 1 + α D -α Y - 1 1 + α D + α Y + 1 3 + 2α D 3 - 2α Y 3 • 1 (1 + α D -α Y ) 3 - 1 (1 + α D + α Y ) 3 -(1 + α D -α Y ) • 1 (1 + α D -α Y ) 2 - 1 (1 + α D + α Y ) 2 + P t (1, 1, 1, 0) • α D 3 + α Y 1 (1 + α D + α Y ) 3 ,



Due to the space limit, we provide additional discussions on decision-distribution interplay in Appendix B.1. Due to the space limit, we provide detailed discussions on related works in Appendix A. In Appendix B.2, we analyze the scenario in which Tier Balancing is initially satisfied. The details of the derivation can be found in Appendix B.4. In Appendix B.5, we present illustrative figures to demonstrate the connection between qualitative and quantitative assumptions. Our code repository is available on Github: https://github.com/zeyutang/TierBalancing. 1+(α D +α Y ) .



Figure 2: Illustration of the interplay between decision with perfect predictors and data dynamics (20 steps) on simulated data, with different initialization of tier H t .

Figure 3: Illustration of the interplay between decision with Counterfactual Fair predictors and the data dynamics (20 steps) on the credit score data set. Panel (a) and (b) present the step-bystep tracks of update in tier, accuracy, and approval rates for different groups; panel (c) presents group-conditioned distributions of tier before (left) and after (right) 20 steps of interventions. The legend is shared across panel (a), (b), and (c).

Detailed Discussions on Related Works A.1 Causal Notions of Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Fairness Inquires in Dynamic Settings . . . . . . . . . . . . . . . . . . . . . . . . B Additional Results, Technical Details, and Discussions B.1 Discussions on the Causal Modeling of Decision-Distribution Interplay . . . . . . B.2 When Tier Balancing is Initially Satisfied . . . . . . . . . . . . . . . . . . . . . . B.3 A Remark on Fact 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Detailed Derivation of ∆ STIR | t+1 t (Section 3.2.1) . . . . . . . . . . . . . . . . . . B.5 Further Illustration on Assumption 3.5 and Assumption 3.6 . . . . . . . . . . . . . B.6 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Potential Limitations of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . C Proof of Results C.1 Proof for Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Proof for Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Proof for Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Proof for Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF TABLES 1 When α D > α Y , compare cases of H t+1 with different values of A t . . . . . . . . 2 When α D < α Y , compare cases of H t+1 with different values of A t . . . . . . . . 3 When α D = α Y = α, compare cases of H t+1 with different values of A t . . . . . . 4 When α D > α Y , list possible instantiations of φ t+1 (e t+1 ). . . . . . . . . . . . . 5 When α D < α Y , list possible instantiations of φ t+1 (e t+1 ). . . . . . . . . . . . . 6 When α D = α Y = α, list possible instantiations of φ t+1 (e t+1 ). . . . . . . . . . .

(2022). They consider a detailed causal modeling on the college admission running example and analyze previously proposed (instantaneous) causal fairness notions, namely, Counterfactual Predictive Parity(Coston et al., 2020), Counterfactual Equalized Odds(Coston et al., 2020), and Conditional Principal Fairness(Imai & Jiang, 2020). Our work is different in several ways: instead of utilizing a static graph, we focus on the decisiondistribution interplay and explicitly capture both observed and latent variables along the temporal axis; different from analyzing one-step downstream consequence in terms of utility, we formulate a long-term fairness goal and investigate the challenges and opportunities revealed by the notion.B ADDITIONAL RESULTS, TECHNICAL DETAILS, AND DISCUSSIONSIn this section, we provide additional results, technical details, and discussions of our work. In Appendix B.1, we provide additional discussions on our causal modeling of the decision-distribution interplay; in Appendix B.2, we analyze the situation where Tier Balancing is initially attained; in Appendix B.3, we discuss the role of exogenous terms and provide a remark on Fact 3.2; in Appendix B.4, we present the detailed derivation of Single-step Tier Imbalance Reduction (STIR) term ∆ STIR | t+1 t ; in Appendix B.5, we illustrate the connection between Assumption 3.5 and Assumption 3.6; in Appendix B.6, we present additional experimental results; in Appendix B.7, we discuss potential limitations of our work. B.1 DISCUSSIONS ON THE CAUSAL MODELING OF DECISION-DISTRIBUTION INTERPLAY In Appendix B.1.1, we provide additional details of the involved dynamics in the causal modeling of the decision-distribution interplay. In Appendix B.1.2, we discuss the relation between the practical scenarios and the modeled dynamics.

Figure 4: The causal modeling of the decision-distribution interplay. The circle indicates that the corresponding variable is unobserved. We use diamond to denote the underlying causal factor and explicitly indicate the (potential) non-stationary nature of the decision-making strategies across time.

t+1 in φ t+1 (e (j) t+1 ) and e (j)t in φ t (e (j)t ), since it is not necessarily the case that e (j) t+1 = e (j)

Figure 5: An illustration of the connection between qualitative and quantitative assumptions in terms of the one-step update of the conditional joint distribution q T f T (0, e T ), f T (1, e T ) | d, d ′ , y, y ′ (when y < y ′ , and from T = t to T = t + 1).

Figure 6: Illustration of the interplay between decision with accuracy-oriented predictors and the data dynamics (20 steps) on the credit score data set. Panel (a) and (b) present the step-bystep tracks of update in tier, accuracy, and approval rates for different groups; panel (c) presents group-conditioned distributions of tier before (left) and after (right) 20 steps of interventions. The legend is shared across panel (a), (b), and (c).

y, y ′ and P t+1 (d, d ′ , y, y ′ ) since they involve future information D t+1 and Y t+1 at the standpoint of time step T = t. B.5 FURTHER ILLUSTRATION ON ASSUMPTION 3.5 AND ASSUMPTION 3.6

Figure 7: Illustration of the sliced squares on the f t (0, e t ), f t (1, e t ) plane. Depending on the initial situation, i.e., the slice that the f t (0, e t ), f t (1, e t ) pair falls upon, the term |φ(e t+1 )| -|φ(e t )| takes different values. The shaded slices indicate that if the initial situation satisfies the corresponding condition, the calculated |φ(e t+1 )| -|φ(e t )| < 0. (xi.1.3) |φ(e t+1 )| -|φ(e t )| = (α D + α Y + 2)f t (0, e t ) + (α D + α Y -2)f t (1, e t ) > 0

e., for Case (i), or (d, d ′ , y, y ′ ) = (1, 1, 1, 1), i.e., for Case (xvi), |φ(e t+1 )| -|φ(e t )| = 0. Now we proceed to the second step and aggregate |φ(e t+1 )| -|φ(e t )| terms. According to Equation 5 and Equation B.5, for the perfect predictor we have:

Since ∆ STIR | t+1 t is a weighted aggregation of |φ(e t+1 )| -|φ(e t )| (as defined in Equation

When (d, d ′ , y, y ′ ) satisfies y = y ′ , i.e, for Case (i), Case (iv), Case (xiii), and Case (xvi), we have |φ(e t+1 )| -|φ(e t )| = 0. Therefore we only need to calculate ∆ (Counterfactual Fair) STIR t+1 tfor Case (ii), Case (iii), Case (xiv), and Case (xv) (although α Y = 0, we explicitly keep the hyperparameter α Y in the proof for the purpose of notation consistency).According to Equation 5 and Equation B.5, for the Counterfactual Fair predictor we have:t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 0, 0, 0, 1, ϵ, α D , α Y )}

ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 0, 0, 1, 0, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 0, 0, 1, 0 dξdϵ + P t (1, 1, 0, 1) • ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 1, 1, 0, 1, ϵ, α D , α Y )}

ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori) t ; 0, 0, 0, 1, ϵ, α D , α Y )} • q t f t (0, ϵ), f t (1, ϵ) | 0, 0, 0, 1 dξdϵ = γ (low) D + α Y )ϕ 0 + (α Dα Y )ϕ 1 dϕ 1 dϕ 0 α Dα Y )ϕ 0 + (2α D + α Y )ϕ 1 dϕ 1 dϕ 0 + α Y )ϕ 0 -(α Dα Y )ϕ 1 dϕ 0 dϕ 1 , ϵ∈E ξ∈E |φ t+1 (ξ)| -|φ t (ϵ)| • 1{φ t+1 (ξ) = G(f t , g D t , g Y (ori)

α D + α Y )ϕ 0 -(2 + α Dα Y )ϕ 1 dϕ 1 dϕ 0 D + α Y )ϕ 0 + (α Dα Y )ϕ 1 dϕ 0 dϕ 1 0 -(α D + α Y )ϕ 1 dϕ 1 dϕ 0 + γ (low) + α Y )ϕ 0 -(α Dα Y )ϕ 1 dϕ 1 dϕ 0 .Since γ(low)  0001 + γ (up) 0001 = 2, γ (low) 0010 + γ (up) 0010 = 2, γ (low) 1101 + γ (up) 1101 = 2, and γ (low) 1110 + γ (up) 1110 = 2, we can derive the form of the term ∆

Proposition B.1. When Tier Balancing is initially satisfied, i.e., H t ⊥ ⊥ A t , under Fact 3.2, Assumption 3.3, and Assumption 3.4, as well as the specified dynamics, the Demographic Parity decisionmaking strategy, i.e., D t ⊥ ⊥ A t , can ensure Tier Balancing still holds true for the next time step, i.e., H t+1 ⊥ ⊥ A t+1 .Proof. To begin with, since H t ⊥ ⊥ A t , by Fact 3.2, H t is not a function of A t . As a direct result, Y (ori) t is also not a function of A t (since the distribution of Y (ori) t is fully determined by the value of H t = h t ). Besides, since D t satisfies Demographic Parity, D t ⊥ ⊥ A t , and therefore by Fact 3.2, D t is not a function of A t .According to Assumption 3.3, under the specified dynamics, H t+1 is fully determined by (H t , D t , Y(ori)

, (A t , E t ) are root causes of all other variables (H t , X t,i , Y Applying Definition B.2, we can utilize the functional causal model and represent each variable with a function (the structural equation) of its direct causes (including observed parents and unobserved exogenous terms). Then, we can iteratively replace variables with its corresponding structural equation and eventually represent variables in (H t , X t,i , Y

2) Firstly, in Appendix B.4.1, we characterize the conditional joint density of f T (0, E T ), f T (1, E T ) . Then, in Appendix B.4.2, we focus on the situation changes of each individual from T = t to T = t + 1 induced by the specified dynamics. Finally, in Appendix B.4.3, we can calculate the expectation in Equation B.2 by aggregating situation changes for each individual from T = t to T = t + 1.

(•); the equality (ii) is derived from the Law of Iterated Expectation, keeping track of individual-level situation changes in the inner conditional expectation; the equality (iii) is the aggregation of individual-level situation changes by plugging in the conditional joint density q

Theorem. Let us consider the general situation where both D t and Y (ori) t are dependent with A t , i.e., D t ⊥ ̸ ⊥ A t , Y

Theorem. Let us consider the general situation where both D t and Y (ori) A t . Let us further assume that the data dynamics satisfies α D ∈ (0, 1

ACKNOWLEDGEMENT

This project was partially supported by the National Institutes of Health (NIH) under Contract R01HL159805, by the NSF-Convergence Accelerator Track-D award #2134901, by a grant from Apple Inc., a grant from KDDI Research Inc., and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research. YC and YL are partially supported by the National Science Foundation (NSF) under grants IIS-2143895 and IIS-2040800, and CCF-2023495. 

annex

Proof (full). For the perfect predictor D t = Y (ori) t , among all possible instantiations of φ t+1 (e t+1 ) as listed in Table 4, Table 5, and Table 6 , not every case corresponds to a nonzero P t (d, d ′ , y, y ′ ) and therefore may not contribute to the computation of ∆ STIR | t+1 t as detailed in Equation 5. By applying Equation 8 we only need to consider Case (i), Case (vi), Case (xi), and Case (xvi) When (d, d ′ , y, y ′ ) = (0, 1, 0, 1), i.e., for Case (vi):(vi.1.1) |φ(e t+1 )| -|φ(e t )| = -(α D + α Y ) f t (0, e t ) + f t (1, e t ) < 0 if we have f t (0, e t ) ∈ (0, 1], f t (1, e t ) ∈ (0,.where γ (low) 0101 , γ (up) 1010 ∈ (0, 1) (according to Assumption 3.6), and α D , α Y ∈ [0, 1 2 ) (according to Assumption 3.3). Now let us consider the data dynamics where α Y = 0 and simplify the form of ∆ 1, 1, 0, 1) + P t (1, 1, 1, 0) .As we can see, as long as we have Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0) < 27 8 and at the same time the parameter satisfies α D ∈ ( Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0)2 ), it is possible for the counterfactual fair predictor to achieve a negative value for ∆ STIR | t+1 t after a one-step intervention:Pt(0,0,0,1)+Pt(0,0,1,0) < 27 8 α D ∈ ( Pt(1,1,0,1)+Pt(1,1,1,0) Pt(0,0,0,1)+Pt(0,0,1,0)Published as a conference paper at ICLR 2023 Table 1 Published as a conference paper at ICLR 2023 Table 2 Published as a conference paper at ICLR 2023 Table 3 Published as a conference paper at ICLR 2023 Table 4 : When α D > α Y , list possible instantiations of φ t+1 (e t+1 ).Published as a conference paper at ICLR 2023 Table 4 (continued from the previous page)Published as a conference paper at ICLR 2023 Table 5 : When α D < α Y , list possible instantiations of φ t+1 (e t+1 ).Published as a conference paper at ICLR 2023 Table 5 (continued from the previous page)Published as a conference paper at ICLR 2023 Table 6 : When α D = α Y = α, list possible instantiations of φ t+1 (e t+1 ). 

