CAUSAL REPRESENTATION LEARNING FOR INSTANTANEOUS AND TEMPORAL EFFECTS IN INTERACTIVE SYSTEMS

Abstract

Causal representation learning is the task of identifying the underlying causal variables and their relations from high-dimensional observations, such as images. Recent work has shown that one can reconstruct the causal variables from temporal sequences of observations under the assumption that there are no instantaneous causal relations between them. In practical applications, however, our measurement or frame rate might be slower than many of the causal effects. This effectively creates "instantaneous" effects and invalidates previous identifiability results. To address this issue, we propose iCITRIS, a causal representation learning method that allows for instantaneous effects in intervened temporal sequences when intervention targets can be observed, e.g., as actions of an agent. iCITRIS identifies the potentially multidimensional causal variables from temporal observations, while simultaneously using a differentiable causal discovery method to learn their causal graph. In experiments on three datasets of interactive systems, iCITRIS accurately identifies the causal variables and their causal graph.

1. INTRODUCTION

Recently, there has been a growing interest in causal representation learning (Schölkopf et al., 2021) , which aims at learning representations of causal variables in an underlying system from highdimensional observations like images. Several works have considered identifying causal variables from time series data, assuming that the variables are independent of each other conditioned on the previous time step (Gresele et al., 2021; Khemakhem et al., 2020a; Lachapelle et al., 2022a; b; Lippe et al., 2022b; Yao et al., 2022a; b) . This assumes that within each discrete, measured time step, intervening on one causal variable does not affect any other variable instantaneously. However, in real-world systems, this assumption is often violated, as there might be causal effects that act faster than the measurement or frame rate (Faes et al., 2010; Hyvärinen et al., 2008; Moneta et al., 2006; Nuzzi et al., 2021) . Consider the example of a light switch and a light bulb. When flipping the switch, there is an almost immediate effect on the light by turning it on or off, changing the appearance of the whole room instantaneously. In this case, an intervention on a variable (e.g., the switch) also affects other variables (e.g., the bulb) in the same time step, violating the assumption that each variable is independent of the others in the same time step, conditioned on the previous time step. In biology, some protein-protein interactions also occur nearly-instantaneously (Acuner Ozbabacan et al., 2011) . To overcome this limitation, we consider the task of identifying causal variables and their causal graphs from temporal sequences, even in case of instantaneous cause-effect relations. This task contains two main challenges: identifying the causal variables from observations, and learning the causal relations between those variables. We show that, as opposed to temporal sequences without instantaneous Published as a conference paper at ICLR 2023 effects, neither of these two tasks can be completed without the other: without knowing the variables, we cannot identify the graph; but without knowing the graph, we cannot identify the causal variables, since they are not conditionally independent. In particular, in contrast to causal relations across time steps, the orientations of instantaneous edges are not determined by the temporal ordering, hence requiring to jointly solve the tasks of causal representation learning and causal discovery. As a starting point, we consider the setting of CITRIS (Causal Identifiability from Temporal Intervened Sequences; Lippe et al. (2022b) ). In CITRIS, potentially multidimensional causal variables interact over time, and interventions with known targets may have been performed. While in that work all causal relations were assumed to be temporal, i.e., from variables in one time step to variables in the next time step, we generalize this setting to include instantaneous causal effects. In particular, we show that in general, causal variables are not identifiable if we do not have access to partiallyperfect interventions, i.e., interventions that remove the instantaneous parents. If such interventions are available, we prove that we can identify the minimal causal variables (Lippe et al., 2022b) , i.e., the parts of the causal variables that are affected by the interventions, and their temporal and instantaneous causal graph. Our results generalize the identifiability results of Lippe et al. (2022b) , since if there are no instantaneous causal relations, any intervention is partially-perfect by definition. As a practical implementation, we propose instantaneous CITRIS (iCITRIS). iCITRIS maps high-dimensional observations, e.g., images, to a lower-dimensional latent space on which it learns an instantaneous causal graph by integrating differentiable causal discovery methods into its prior (Lippe et al., 2022a; Zheng et al., 2018) . In experiments on three different video datasets, iCITRIS accurately identifies the causal variables as well as their instantaneous and temporal causal graph. Our contributions are: • We show that causal variables in temporal sequences with instantaneous effects are not identifiable without interventions that remove instantaneous parents. • We prove that when having access to such interventions with known targets, the minimal causal variables can be identified along with their causal graph under mild assumptions. • We propose iCITRIS, a causal representation learning method that identifies minimal causal variables and their causal graph even in the case of instantaneous causal effects.

Related Work

We provide an extended discussion on related work in Appendix C. Early works on causal representation learning focused on identifying independent factors of variations (Klindt et al., 2021; Kumar et al., 2018; Locatello et al., 2019; 2020b; Träuble et al., 2021) , in settings similar to Independent Component Analysis (ICA) (Comon, 1994; Hyvärinen et al., 2001; 2019) . In particular, Lachapelle et al. (2022a; b) ; Yao et al. (2022a; b) discuss the identifiability of causal variables from temporal sequences. Yet, in all of these ICA-based setups, causal variables are required to be conditionally independent. For causally-dependent variables, Yang et al. (2021) learn causal variables from labeled images in a supervised manner. Ahuja et al. (2022) ; Brehmer et al. (2022) identify causal variables with unknown causal relations from pairs of observations that only differ in a subset of causal factors influenced by an intervention, i.e., having counterfactual observations. As discussed by Pearl (2009) , however, knowing counterfactuals is not realistic in most scenarios. Instead, CITRIS (Lippe et al., 2022b) focuses on temporal sequences, in which also the variables that are not intervened upon can still continue evolving over time. On the other hand, in this setting the intervention targets need to be known. Moreover, within a time step, the causal variables are assumed to be independent conditioned on the variables of the previous time step, hence not allowing for instantaneous effects. To the best of our knowledge, iCITRIS is the first method to identify causal variables and their causal graph from temporal, intervened sequences even for potentially instantaneous causal effects, without requiring counterfactuals or data labeled with the true causal variables.

2. RELEVANT BACKGROUND AND DEFINITIONS

In this work, we start from the setting of Temporal Intervened Sequences (TRIS) (Lippe et al., 2022b) . For clarity, we provide a brief overview of TRIS and discuss previous identifiability results, before extending and generalizing the theory to instantaneous effects.

2.1. TEMPORAL INTERVENED SEQUENCES

Temporal intervened sequences (TRIS) (Lippe et al., 2022b) are a latent temporal causal process S with K causal variables (C t 1 , C t 2 , ..., C t K ) T t=1 (e.g., the light switch and bulb), representing a dynamic Published as a conference paper at ICLR 2023 Bayesian network (DBN) (Dean et al., 1989; Murphy, 2002) . Each causal variable C i is instantiated at each time step t, denoted by C t i , and its causal parents pa(C t i ) are a subset of the variables at time t-1. To represent interventions, the causal graph is augmented with binary intervention variables I t ∈ {0, 1} K , where I t i = 1 refers to the causal variable C t i having been intervened at time step t. This setting can model soft interventions (Eberhardt, 2007) , i.e., interventions that change the conditional distribution p(C t i |pa(C t i ), I t i = 1) ̸ = p(C t i |pa(C t i ), I t i = 0) (e.g., flipping the light switch). This trivially includes also perfect interventions, do(C i = c i ) (Pearl, 2009) , which cause the target variable to be independent of its parents. To allow for arbitrary sets of interventions, the intervention variables are considered to be confounded by an unobserved variable R t . While the intervention variables I t are assumed to be observed, the actual values of the intervened variables, e.g., the state of the light switch, are not. The graph and its parameters are assumed to be time-invariant (i.e., repeat across time steps), causally sufficient (i.e., no latent confounders besides the variables mentioned before), and faithful (i.e., no additional independences w.r.t. the ones encoded in the graph). In this setting, causal variables can be scalar or span over multiple dimensions, i.e., C i ∈ D Mi i with the dimensionality M i ≥ 1 and D i being the domain. For example, a 3d position can be represented as C i ∈ R 3 . The causal variable space is defined as C = D M1 1 × D M2 2 × ... × D M K K . Instead of observing the causal variables directly, we measure a high-dimensional observation X t , e.g., an image, representing a noisy, entangled view of all causal variables C t = (C t 1 , C t 2 , ..., C t K ) at time step t. The observation function is defined as h(C t 1 , C t 2 , ..., C t K , E t ) = X t , where E t ∈ E is any noise on the observation X t that is independent of C t (e.g., pixel or color shifts), and h : C × E → X is a function from the causal variable space C and the space of the observation noise E = R L to the observation space X ⊆ R N . To allow unique identification of causal variables from observations, the observation function h is assumed to be bijective, i.e., there exist a unique inverse of h for all X ∈ X .

2.2. MINIMAL CAUSAL VARIABLES AND IDENTIFIABILITY CLASS

Multidimensional causal variables are not always fully identifiable in TRIS, when interventions only affect a subset of the variables' dimensions (Lippe et al., 2022b) , e.g., only x in a 3d position [x, y, z]. To account for these interventions, each causal variable C t i can be split into an intervention-dependent part s var i (C t i ), e.g., [x] , and an intervention-independent part s inv i (C t i ), e.g., [y, z] , where s i (C t i ) = (s var i (C t i ), s inv i (C t i )) is an invertible function. Under this split, the distribution of C t i becomes: p s i (C t i )|pa(C t i ), I t i = p s var i (C t i )|pa(C t i ), I t i • p s inv i (C t i )|pa(C t i ) With this setup, Lippe et al. (2022b) define a minimal causal variable as follows: Definition 2.1. The minimal causal variable of a causal variable C t i w.r.t. its intervention variable I t i is the intervention-dependent part s var i (C t i ) of the split s i (C t i ) = (s var i (C t i ), s inv i (C t i )), such that the split maximizes the information content H(s inv i (C t i )|pa(C t i )) in terms of the limiting density of discrete points (LDDP) (Jaynes, 1957; 1968) . To identify the minimal causal variables from data triplets {x t , x t+1 , I t+1 }, CITRIS approximates the observation function h by learning an invertible map g θ : X → Z, with Z ∈ R M being a latent space with M dimensions. For this latent space, CITRIS learns an assignment function ψ : 1..M → 0..K , mapping each dimension of Z to a causal variable C 1 , ..., C K . The index 0 is used for the observation noise or intervention-independent variables. We denote the set of latents assigned to the causal variable C i with z ψi = {z j |j ∈ 1..M , ψ(j) = i}. On this latent space, CITRIS models a prior p ϕ,ψ (z t+1 |z t , I t+1 ) where each group of latent variables, z t+1 ψi , is conditioned on the previous time step z t and its intervention variable I t+1 i , but independent among each other within the same time step. The causal graph can be found by pruning the temporal dependencies in this prior. Under this model, Lippe et al. (2022b) consider a causal system S to be identified by a model M if its minimal causal variables are identified up to an invertible transformation, and M recovers the true causal graph in S. CITRIS is shown to identify S if it maximizes the information content of z ψ0 , under the constraint of maximizing the likelihood of data points {X t , X t+1 , I t+1 }, and no intervention variable I t i is a deterministic function of any other intervention variable. However, if causal effects occur faster than the observation rate, an intervention influences also other variables in the same time step in addition to its target, leading CITRIS to potentially identify incorrect variables. In this paper, we generalize this identifiability result to systems where instantaneous causal relations may exist.

3. IDENTIFYING CAUSAL VARIABLES WITH INSTANTANEOUS EFFECTS

In this section, we generalize Temporal Intervened Sequences (TRIS) to settings where causal relations can be potentially instantaneous. First, we discuss the challenges arising from instantaneous effects, and then present solutions to overcome these challenges. Finally, we derive our identifiability results.

3.1. ITRIS AND CHALLENGES OF INSTANTANEOUS EFFECTS

We extend TRIS to a setting we call instantaneous Temporal Intervened Sequences (iTRIS) which allows for instantaneous causal effects. In iTRIS, causal variables within the same time step can cause each other, as long as the graph remains acyclic. This means that, for example, C t+1 i can cause C t+1 j for i ̸ = j, as long as there is no directed path C t+1 i → • • • → C t+1 i . Figure 1 summarizes this setting. While the addition of instantaneous effects may seem like a small change, it violates the key assumption of most previous works (Khemakhem et al., 2020a; Lachapelle et al., 2022a; b; Lippe et al., 2022b; Yao et al., 2022a; b) , namely that causal variables within a time step are independent conditioned on some external variable. As a consequence, we have to differentiate between causal models in a much larger function space than before, making identifiability a considerably harder task. C t C t+1 1 C t+1 2 • • • C t+1 K X t+1 E t+1 X t E t I t+1 1 I t+1 2 • • • I t+1 K R t+1 Causal variables All causal variables C t+1 and observation noise E t+1 cause the observation h(C t+1 , E t+1 ) = X t+1 . R t+1 is a latent confounder allowing for dependencies between intervention variables.

Instantaneous

To formalize this intuition, consider the following example. Assume we have two latent causal variables C 1 and C 2 , and, for simplicity, no temporal relations. The causal variables C 1 and C 2 do not cause each other, and we have an arbitrary observation function h(C 1 , C 2 ) = X and distributions p 1 (C 1 ), p 2 (C 2 ). In this example, if our method allows for instantaneous effects, we cannot identify the causal variables or their graph from p x (X) alone, since there are multiple representations that model p x (X) equally well. For instance, the representation Ĉ1 = C 1 , Ĉ2 = C 1 + C 2 with the causal graph Ĉ1 → Ĉ2 model the same observation distribution since p2 ( Ĉ2 | Ĉ1 ) = p2 (C 1 + C 2 |C 1 ) = p 2 (C 2 ) and hence p( Ĉ2 | Ĉ1 )p( Ĉ1 ) = p(C 1 )p(C 2 ) = p x (X) . This happens even under soft interventions, because the causal graph can remain unchanged. However, if we have interventions on C i that remove its instantaneous parents, then the learned representation of C i must be independent of any instantaneous effects. This eliminates Ĉ1 , Ĉ2 , since Ĉ2 ̸ ⊥ ⊥ Ĉ1 |I 2 = 1. We refer to these interventions as partially-perfect: Definition 3.1. A partially-perfect intervention on a causal variable C i is a soft intervention that removes all parents in the instantaneous graph: p(C t i |pa(C t i ), I t i = 1) = p I (C t i |pa temp (C t i )) where pa temp (C t i ) = {C t-1 j |j ∈ 1..K , C t-1 j ∈ pa(C t i ) } and p I is the post-interventional distribution. As an example, consider an intervention that sets C t i = C t-1 i + ϵ , where ϵ is noise. While this intervention breaks instantaneous relations, the target still depends on the previous time step, making it partially-perfect. Using the partially-perfect interventions, we can prove the following: Lemma 3.2. In iTRIS, a causal variable C i cannot be identified up to an invertible transformation T i that is independent of all other causal variables, if C i can have instantaneous parents and no partially-perfect interventions on C i are provided. We provide the proof for this lemma and an example with temporal relations in Appendix D.2.1. In the following, we assume that all interventions in iTRIS are partially-perfect. This subsumes the soft intervention setup in TRIS by definition, since the instantaneous graph is empty in this case. Published as a conference paper at ICLR 2023

3.2. IDENTIFYING THE MINIMAL CAUSAL VARIABLES IN ITRIS

Using partially-perfect interventions, we introduce our identifiability results in iTRIS. For simplicity of the theoretical analysis, we follow Lippe et al. (2022b) and consider a continuous domain for causal variables, i.e., D = R, and that distributions have full support. In Section 5, we empirically extend the results to variables with categorical and circular domains, and distributions with limited support. Further, we assume that the temporal dependencies and interventions break all symmetries in the variables' distributions such that distributional implies functional independence. For a Gaussian, this entails that the mean cannot be a constant (see Appendix D.2 for details on the assumptions). Similar to CITRIS, we learn an invertible mapping, g θ : X → Z, and an assignment function ψ : 1..M → 0..K to align latent to causal variables. Differently, however, we also learn a directed, acyclic graph G on the K + 1 latent variable groups z ψ0 , ..., z ψ K to model the instantaneous causal relations. The graph G induces a parent structure denoted by z ψ pa i = {z j |j ∈ 1..M , ψ(j) ∈ pa G (i)} where pa G (0) = ∅, i.e., the variables in z ψ0 have no instantaneous parents. Meanwhile, the temporal causal graph between C t and C t+1 is implicitly learned by conditioning z t+1 on all latents of the previous time step, z t , and can be pruned after training. This results in the following prior: p ϕ,ψ,G z t+1 |z t , I t+1 = p ϕ z t+1 ψ0 |z t • K i=1 p ϕ z t+1 ψi |z t , z t+1 ψ pa i , I t+1 i . With this setup, we can now formally define our identifiability class that contains the one in CITRIS: Definition 3.3. A model M = ⟨θ, ϕ, ψ, G⟩ identifies a causal system S = ⟨C, E, h⟩ iff for each causal variable C i , i ∈ 1. .K , the following two conditions hold: (1) each minimal causal variable is identified up to an invertible transformation T i , i.e., for all observations x ∈ X : s var i (c i ) = T i (z ψi ), where c i = [h -1 (x)] i is the value of the true causal variable and z ψi = [g θ (x)] ψi is the value of the estimated minimal causal variable, and (2) the estimated parents pa(z t ψi ) are the same as the true parents of the minimal causal variable s var i (C t i ), i.e., the estimated parent set contains z τ ψj , j ∈ 1..K , τ ∈ {t-1, t} iff s var j (C τ j ) is a parent of s var i (C t i ), and it contains z τ ψ0 iff there exist l ∈ 1..K for which s inv l (C τ l ) is a parent of s var i (C t i ). For the light switch example, this means that the latent variable z ψ1 must model the switch's state, z ψ2 the bulb's state, and the instantaneous graph is z ψ1 → z ψ2 . Compared to ICA-based results, this identifiability class explicitly aligns the latent variables with the causal variables and, thus, we do not rely on a permutation equivalence class. To identify S from observations, we consider a dataset of triplets {x t , x t+1 , I t+1 } with observations x t , x t+1 ∈ X and intervention variables I t+1 . This dataset could be created interactively or, for example, recorded by an expert. With this, the objective becomes: p ϕ,ψ,θ,G x t+1 |x t , I t+1 = det J g θ (x t+1 ) • p ϕ,ψ,G z t+1 |z t , I t+1 where the Jacobian of g θ , det J g θ (x t+1 ) , is introduced due to the change of variables of x to z. If dim(X ) > dim(C × E), we consider that g θ contains an arbitrary, fixed map from X to the lower dimensionality. Under the assumption that g θ , p ϕ are universal function approximators and the dataset is unlimited in size, we derive the following identifiability result: Theorem 3.4. In iTRIS, a model M * = ⟨θ * , ϕ * , ψ * , G * ⟩ identifies a causal system S = ⟨C, E, h⟩ (Definition 3.3) if M * , under the constraint of maximizing the likelihood p ϕ,θ,G (X t+1 |X t , I t+1 ): (1) maximizes the information content H(z t+1 ψ0 |z t ) in terms of the LDDP (Jaynes, 1957; 1968 ), (2) minimizes the number of edges in G * , and (3) no intervention variables I t i , I t j are deterministically related, i.e., ∀j ̸ = i : ¬(∃f, ∀t : I t i = f (I t j )). Intuitively, this theorem shows that we can identify the minimal causal variables, even when instantaneous effects are present, under the same constraints as CITRIS. The proof in Appendix D follows three main steps. First, we show that the true observation function constitutes a global optimum of Equation ( 3), but is not necessarily unique. Second, we derive that any global optimum must identify the minimal causal variables. Finally, we show that optimizing the data likelihood identifies the complete causal graph, i.e., instantaneous and temporal, between the minimal causal variables.

4. CAUSAL REPRESENTATION LEARNING WITH INSTANTANEOUS EFFECTS

Based on our theoretical results, we propose iCITRIS, a generalization of CITRIS (Lippe et al., 2022b) . We first review the original CITRIS architecture, and then describe our extensions in iCITRIS.

4.1. BASELINE: CITRIS

CITRIS is build upon a variational autoencoder (VAE) (Kingma et al., 2014) , where the (convolutional) encoder and decoder approximate the invertible map g θ . To promote the identification of the causal variables in latent space, the prior of the VAE follows Equation (2), excluding the instantaneous parents. All latent distributions p ϕ are usually implemented as conditional Gaussians, with the mean and std predicted by a small MLP per latent variable. Finally, the VAE is trained via maximum likelihood on p(X t+1 |X t , I t+1 ). Alternatively, CITRIS can also be trained on the representations of a pretrained autoencoder, where the map g θ is replaced by a normalizing flow (Rezende et al., 2015) . We follow the same setup in iCITRIS, but extend CITRIS' prior to instantaneous effects.

4.2. LEARNING THE INSTANTANEOUS CAUSAL GRAPH

To learn the instantaneous causal graph simultaneously with the causal representation, we incorporate recent differentiable score-based causal discovery methods in iCITRIS. Given a distribution over graphs p(G), the conditional distribution over the latent variables z t+1 of Equation (2) becomes: p ϕ,ψ,G z t+1 |z t , I t+1 = p ϕ z t+1 ψ0 |z t • E G∼p(G) K i=1 p ϕ z t+1 ψi |z t , z t+1 ψ pa i , I t+1 i (4) where the parent sets, z ψ pa i , depend on the graph structure G. The goal is to jointly optimize p ϕ and p(G) under maximizing the likelihood objective of Equation ( 3), such that p(G) is peaked at the correct causal graph. To this end, we experiment with two causal discovery methods that allow for continuous optimization: NOTEARS (Zheng et al., 2018) , and ENCO (Lippe et al., 2022a) . NOTEARS (Zheng et al., 2018) casts structure learning as a continuous optimization problem by providing a continuous constraint on the adjacency matrix to enforce acyclicity. Following Ng et al. (2022) , we model the adjacency matrix with independent edge likelihoods, and differentially sample from it using the Gumbel-Softmax trick (Jang et al., 2017) . We use these samples as graphs in the prior p ϕ z t+1 |z t , I t+1 to mask the parents of the individual causal variables, and obtain gradients for the graph through the maximum likelihood objective of the prior. In order to promote acyclicity, we use the constraint as a regularizer, and exponentially increase its weight over training. ENCO (Lippe et al., 2022a) , on the other hand, uses interventional data and two separate parameter sets: one for the orientation per edge, and one for the existence per edge. By only using interventions to update the orientation parameters, ENCO converges to the true, acyclic graph under single-target interventions in the sample limit. Yet, we found it to also work well under multi-target interventions in iCITRIS. ENCO uses low-variance, unbiased gradients based on REINFORCE (Williams, 1992) , potentially providing a more stable optimization than NOTEARS. For efficiency, we merge the two learning stages of ENCO and update both the graph and distribution parameters at each iteration.

4.3. STABILIZING THE OPTIMIZATION PROCESS

Simultaneously identifying the causal variables and their graph leads to a chicken-and-egg situation: without knowing the variables, we cannot identify the graph; but without knowing the graph, we cannot identify the causal variables. This can cause the optimization to be unstable and to converge to local minima with incorrect graphs. To stabilize it, we propose the two following approaches. Graph learning scheduler During the first training iterations, the assignment of latent to causal variables is almost random, since the gradients for the graph parameters are very noisy and uninformative. Thus, we use a learning rate schedule for the graph learning parameters such that the graph parameters are frozen for the first couple of epochs. During those training iterations, the model learns to fit the latent variables to the intervention variables under an arbitrary graph, leading to an initial, rough assignment of latent to causal variables. Then, we warm up the learning rate to slowly start the graph learning process while continuing to separate the causal variables in latent space. Mutual information estimator If the provided interventions are fully perfect, i.e., they remove temporal dependencies as well, we can exploit this independence by masking out the temporal parents in the prior distribution under interventions. Furthermore, with perfect interventions, we also enforce the mutual information (MI) (Kullback, 1997) between parents and children under interventions to be zero as an additional regularization. Following work on neural MI estimation (Belghazi et al., 2018; Hjelm et al., 2019; van den Oord et al., 2018) , we train a network to distinguish between samples from the joint distribution p(z t ψi , z t ψ pa i , z t-1 |I t i = 1 ) and the product of their marginals, p(z t ψi |I t i = 1)p(z t ψ pa i , z t-1 |I t i = 1). While the MI estimator optimizes its accuracy, the latents are optimized to do the opposite, effectively forcing z t ψi and its parents to be independent under interventions.

5. EXPERIMENTS

We evaluate iCITRIS on three video datasets with varying difficulties and compare it to common causal representation methods. We include further dataset details in Appendix E, discuss hyperparameters in Appendix F, and provide the code at https://github.com/phlippe/CITRIS.

5.1. EXPERIMENTAL SETTINGS

Baselines Since iCITRIS is, to the best of our knowledge, the first method to identify causal variables with instantaneous effects in this setting, we compare it to methods for identifying conditionally independent causal variables. Firstly, we use CITRIS (Lippe et al., 2022b) and the Identifiable VAE (iVAE) (Khemakhem et al., 2020a) , which both use the previous time step and intervention targets to model conditionally independent variables. Further, to compare to a model with dependencies among latent variables, we evaluate the iVAE with an autoregressive prior, which we denote with iVAE-AR. All methods share the general model setup, e.g., the encoder network architecture, where possible. Evaluation metrics To evaluate the identification of the causal variables, we follow Lippe et al. (2022b) and report the R 2 scores for correlations. In particular, R 2 ij is the score between the true causal variable C i and the latents that have been assigned to the causal variable C j by the learned model, i.e., z ψj . We denote the average correlation of the predicted variable to its true value with R 2 diag = 1/K i R 2 ii (optimal 1), and the maximum correlation besides its true variable with R 2 sep = 1/K i max j̸ =i R 2 ij (optimal 0). Furthermore, to investigate the modeling of the temporal and instantaneous relations between the causal variables, we perform causal discovery as a postprocessing step on the latent representations since the baselines do not explicitly learn the graph, and report the Structural Hamming Distance (SHD) between the predicted and true causal graph.

5.2. 2D COLORED CELLS WITH CAUSAL EFFECTS: VORONOI BENCHMARK

We first conduct experiments on synthetically generated causal graphs with various instantaneous structures to investigate the difficulty and challenges of the task. We consider three instantaneous graph structures: random has a randomly sampled, acyclic graph structure with a probability of 0.5 of two variables being connected by a direct edge, and chain and full represent the minimallyand maximally-connected DAGs respectively. For each graph, we sample temporal edges with an edge probability of 0.25 matching the density of the instantaneous causal graph. Based on these graphs, we create the variable's observational distributions as Gaussians parameterized by randomly initialized neural networks, and provide for simplicity single-target, perfect interventions for all variables. The causal variables are mapped to image space X by firstly applying a randomly initialized, two-layer normalizing flow, and afterwards plotting them as colors in a 32x32 pixels image of a fixed Voronoi diagram as an irregular structure. Thus, the representation learning models need to distinguish between the entanglement by the random normalizing flow and the underlying causal graphs to identify the causal variables, while also performing causal discovery to find the correct causal graph. In Figure 2 , we show the results of all models on graphs of 4, 6, and 9 variables. For the random and chain graphs, iCITRIS-ENCO identifies the causal variables and their causal graph with only minor errors, even for the largest graphs of 9 variables. Even on the challenging full graph, iCITRIS-ENCO considerably outperforms the other models. In contrast, iCITRIS-NOTEARS struggles with the edge orientations and converges to edge probabilities noticeably lower than 1.0, with which the variables cannot be perfectly identified anymore, especially for increasing graph sizes. Meanwhile, Published as a conference paper at ICLR 2023 CITRIS and iVAE find K independent dimensions, conditioned on the previous time step and the intervention targets, instead of the true causal variables, which leads to sparse instantaneous, but wrongly dense temporal graphs. Finally, the autoregressive baseline, iVAE-AR, naturally entangles all dimensions in the latent space, on which the true causal graph cannot be recovered anymore. This underlines the non-triviality of identifying instantaneously-related causal variables. In conclusion, iCITRIS identifies the causal variables and graph well across graph structures and sizes, with ENCO outperforming NOTEARS due to more stable optimization, especially for larger, complex graphs.

5.3. 3D OBJECT RENDERINGS: INSTANTANEOUS TEMPORAL CAUSAL3DIDENT

As a visually challenging dataset, we use the Temporal Causal3DIdent dataset (Lippe et al., 2022b; von Kügelgen et al., 2021) ). For instance, a change in the rotation leads to an instantaneous change in the position of the object, which again influences the spotlight. Overall, we obtain an instantaneous graph of eight edges between the seven multidimensional causal variables. We provide partially-perfect interventions that remove instantaneous parents, but leave the existing temporal dependencies unchanged. Since the dataset is visually complex, we use the normalizing flow variant of iCITRIS and CITRIS applied on a pretrained autoencoder. Table 1 shows that iCITRIS-ENCO identifies the causal variables well and recovers most instantaneous relations, with up to two errors on average. The temporal graph had more false positive edges due to minor correlations. iCITRIS-Published as a conference paper at ICLR 2023 NOTEARS incorrectly orients several edges during training, underlining the benefit of ENCO as the graph learning method in iCITRIS. The baselines have a significantly higher entanglement of the causal variables and struggle with finding the true causal graph. Further, in Appendix G.2, we apply iCITRIS to the original Temporal Causal3DIdent dataset, which contains only temporal causal relations and no instantaneous effects. In this setting, iCITRIS performs on par with CITRIS, verifying that iCITRIS generalizes CITRIS across datasets. In summary, iCITRIS-ENCO can identify the causal variables along with their instantaneous graph well, even in a visually challenging dataset. Finally, we consider a simplified version of the game Pinball, which naturally has instantaneous causal effects: if the paddles are activated when the ball is close, the ball is accelerated immediately. Similarly, when the ball hits a bumper, its light turns on and the score increases immediately. This results in instantaneous effects, especially under common frame rates. In this environment, we consider five causal variables: the position of the left paddle, the right paddle, the ball (position and velocity), the state of the bumpers, and the score. Interventions again remove instantaneous, but keep temporal parents. Pinball is closer to a real-world environment than the other two datasets and has two characteristic differences: (1) many aspects of the environment are deterministic, e.g., the ball movement, and (2) the instantaneous effects are sparse, e.g., the paddles do not influence the ball if it is far away of them. Such a setting violates several assumptions like faithfulness, the full support and potential symmetries in the observational and interventional distributions, questioning whether iCITRIS empirically works here. The results in Table 2 suggest that iCITRIS still works well on this environment. Besides identifying the causal variables well, iCITRIS-ENCO identifies the instantaneous causal graph with minor errors. In contrast, CITRIS entangles the variables much stronger, while iVAE has difficulties identifying all variables in the environment. This shows that iCITRIS can be applied in challenging environments beyond our theoretical limitations, even with deterministic causal effects, while maintaining strong empirical results.

6. CONCLUSION AND DISCUSSION

We propose iCITRIS, a causal representation learning framework for temporal intervened sequences with potentially instantaneous effects. From such sequences, iCITRIS identifies the minimal causal variables while jointly learning the instantaneous and temporal causal graph. In experiments, iCITRIS accurately recovers the causal variables and their graph in three video datasets. Since instantaneous effects are common in real-world settings (Hyvärinen et al., 2008; Nuzzi et al., 2021) , we believe that iCITRIS contributes an important step towards practical causal representation learning methods. Still, as with most other theoretical results, our identifiability theorem is limited by the assumptions it takes. The two most crucial assumptions in iCITRIS are having a dataset, potentially recorded by an expert, that has (1) non-deterministically related, known intervention targets and (2) partially-perfect interventions, i.e., interventions that can remove instantaneous parents. Without the first assumption, causal variables may become entangled in the latent space, and without the latter, instantaneous causal relations may be predicted where none truly exist. However, as demonstrated in experiments on Causal3DIdent and Causal Pinball, iCITRIS still achieves a strong empirical performance in settings that violate other assumptions. For instance, in these experiments, the distributions had limited support and some variables had circular or categorical domains. To extend iCITRIS to even more settings, future work includes investigating a setup where interventions are not directly available, but can be performed by sequences of actions, and targets must be learned in an unsupervised manner. Further, iCITRIS is limited to acyclic graphs, while for instantaneous effects cycles could occur under low frame rates, which is also an interesting future direction. 

A BROADER IMPACT

The importance of causal reasoning for machine learning applications, especially reinforcement learning and latent dynamics understanding, has been emphasized by several previous works (De Haan et al., 2019; Lachapelle et al., 2022b; Pearl, 2009; Schölkopf et al., 2021; Seitzer et al., 2021; Zhang et al., 2020) . Thereby, starting from low-level information like pixels constitutes a considerable challenge, since we aim at reasoning about objects and abstract concepts instead of low-level pixels. We believe that this work contributes an important step towards tackling this challenge since it goes beyond previous work by considering instantaneous effects, a common property in real-world systems (Faes et al., 2010; Hyvärinen et al., 2008; Moneta et al., 2006; Nuzzi et al., 2021) . Besides providing theoretical identifiability results, we also propose a practical algorithm with which one can learn the causal variables and their graph from high-level observations. Furthermore, we envision a reinforcement learning setting as a future application, where a robotic system may be able to interact with an environment. However, the main assumption that prevent us from doing this so far, is the availability of interventions with known targets. In many systems, one might not be able to directly perform such interventions, but rather require several steps of low-level actions. For instance, instead of being provided the intervention targets, future work could consider a robotic setup where one can control a robot arm which can perform several interactions (e.g., flipping a switch), and we believe that our work can constitute the starting point for such extension. Moreover, as we have seen in the experiments on the Causal3DIdent dataset and the Causal Pinball environment, not all assumptions must be strictly fulfilled to identify the variables empirically. Moving towards this empirical goal, recent advances in unsupervised object-centric learning (Engelcke et al., 2020; Kipf et al., 2022; Locatello et al., 2020c) have shown that objects, which can often be considered as groups of causal variables like position and velocity, can be identified from high-dimensional data without labels. A possible combination of such object-centric approaches with our causal representation learning method can relax further assumptions by using the objects as a prior disentanglement of information, opening up further possible applications of iCITRIS. Thus, we believe that this work can form the basis of several future works in this direction. Since the possible applications of causal representation learning and specifically iCITRIS are fairly wide-ranging, there might be potential impacts we cannot forecast at the current time. This includes misuses of the method for unethical purposes. For instance, an incorrect application of the method can be used to justify false causal relations, such as referencing gender and race as causes for other characteristics of a person. Hence, the obligation to use this method in a correct way within ethical boundaries lies on the user, and the outputs of the method should always be critically evaluated. We will emphasize this responsibility of the user in the public license of our code.

B REPRODUCIBILITY STATEMENT

For reproducibility, the code for all models used in this paper is publicly available at https: //github.com/phlippe/CITRIS. Further, we provide the code for generating the Voronoi benchmark, the Instantaneous Temporal Causal3DIdent dataset, and the Causal Pinball environment. More details on the datasets and visualizations are outlined in Appendix E. Moreover, for all experiments of Section 5, we have included a detailed overview of the hyperparameters in F.2 and additional implementation details of the evaluation metrics and model architecture components in Appendix F.1. All experiments have been repeated for at least 3 seeds (5 seeds for the Voronoi benchmark) to obtain stable, reproducible results. We provide an overview of the standard deviations, as well as additional results and ablation studies in Appendix G. Finally, all experiments in this paper were performed on a single NVIDIA TitanRTX GPU with a 6-core CPU. The overall computation time of all experiments together in this paper correspond to approximately 80 GPU days (excluding hyperparameter search and trials during the research).

C EXPANDED RELATED WORK

Early works on causal representation learning focused on identifying independent factors of variations (Klindt et al., 2021; Kumar et al., 2018; Locatello et al., 2019; 2020b; Träuble et al., 2021) . A related line of work, Independent Component Analysis (ICA) (Comon, 1994; Hyvärinen et al., 2001) , tries Published as a conference paper at ICLR 2023 to recover independent latent variables that were transformed by some invertible transformation. ICA was extended to non-linear transformations by exploiting auxiliary variables that make latents mutually conditionally independent (Hyvärinen et al., 2016; Hyvärinen et al., 2019) , combined with deep learning methods like VAEs (Khemakhem et al., 2020a; b; Reizinger et al., 2022; Sorrenson et al., 2020; Zimmermann et al., 2021) and applied to causality (Gresele et al., 2021; Monti et al., 2019; Shimizu et al., 2006) 2009), however, knowing counterfactuals is not realistic in most scenarios. Instead, CITRIS (Lippe et al., 2022b) focuses on temporal sequences, in which also the variables that are not intervened upon at a given time step can still continue evolving over time. On the other hand, in this setting the intervention targets need to be known. Moreover, within a time step, the causal variables are assumed to be independent conditioned on the variables of the previous time step, hence not allowing for instantaneous effects. To the best of our knowledge, iCITRIS is the first method to identify causal variables and their causal graph from temporal, intervened sequences even for potentially instantaneous causal effects, without requiring counterfactuals or data labeled with the true causal variables. Published as a conference paper at ICLR 2023

D PROOFS

In this section, we provide the proof for the identifiability theorem 3.4 in Section 3 and Lemma 3.2. The section is structured into three main parts. First, in Appendix D.1, we give an overview of the notation and elements that are used in the proof. Next, we discuss the assumptions needed for Theorem 3.4, with a focus on why they are needed and what a violation of these assumptions can cause. Additionally, we provide a proof of Lemma 3.2 in this subsection. Finally, we provide the proof of Theorem 3.4, structured into multiple subsections as different main steps of the proof. A detailed overview of the proof is provided in Appendix D.3.

D.1 PRELIMINARIES

Throughout the proof, we will the same notation as used in the main paper, and try to align it as much as possible with Lippe et al. (2022b) . As a summary, we review here an adapted version of the notation and preliminaries of the proof for CITRIS: • We denote the K causal factors in the latent causal dynamical system as C 1 , . . . , C K ; • The dimensions and space of a causal variable is denoted as C i ∈ D Mi i with M i ≥ 1. In the remainder of the proof, we consider D i to be R, i.e., C i being a continuous variable; • We group all causal factors in a single variable C = (C 1 , . . . , C K ) ∈ C, where C is the causal factor space C = D M1 1 × D M2 2 × ... × D M K K ; • The data we base our identifiability on is generated by a latent Dynamic Bayesian network with variables (C t 1 , C t 2 , ..., C t K ) T t=1 ; • We assume to know at each time step the binary intervention variables I t ∈ {0, 1} K+1 where I t i = 1 refers to an intervention on the causal factor C t i . As a special case I t 0 = 0 for all t; • For each causal factor C i , there exists a minimal causal split s var i (C i ), s inv i (C i ) such that s var i (C i ) represents only the variable/manipulable part of C i , while s inv i (C i ) represents the invariable part of C i ; • At each time step, we can access observations x t , x t+1 ∈ X ⊆ R N ; • There exist a bijective mapping between observations and causal/noise space, denoted by h : C × E → X , where E is the space of the noise variable. The bijective map implies that the observations, X , live in a lower-dimensional manifold of size dim(C × E) in R N . For example, in Causal Pinball, we have a limited set of images that can occur. Formally, this means that there exists an inverse to the observation function, h -1 , such that h(h -1 (X)) = X for all X ∈ X , and h -1 (h([C; E])) = [C; E] for all C ∈ C, E ∈ E. • The noise E t ∈ E at a time step t subsumes all randomness besides the causal model which influences the observations. For example, this could be brightness shifts in Causal3D, or color shifts in the Causal Pinball environment since in these setups, no causal factor is encoded in brightness and color respectively. While this setting is quite general, we still require that the values of the causal factors must be identifiable from single observations. Hence, the joint dimensionality of the observation noise and causal model is limited to the image size. • For any model learning a latent space, we denote the vector of latent variables by z t ∈ Z ⊆ R M , where Z is the latent space of dimension M = dim(E) + dim(C). In practice, we usually overestimate M , i.e., M > dim(E) + dim(C); • In iCITRIS, we learn the inverse of the observation function as g θ : X → Z. If dim(X ) > dim(C × E), we consider that g θ contains an arbitrary, fixed map from X to the lower dimensionality. This map can, in theory, be trivially found by ensuring invertibility of all X ∈ X while minimizing the number of dimensions. An alternative interpretation is that g θ is a deterministic variational autoencoder with zero reconstruction loss. In this limit, Nielsen et al. (2020) showed that the encoder-decoder function as an invertible normalizing flow, which is what we base our analysis on as well. In practice, we train a deterministic autoencoder with a latent dimension greater than dim(E) + dim(C), and work on this larger dimensionality; • In iCITRIS, we learn an assignment from latent dimensions to causal factors, denoted by ψ : 1..M → 0..K ; • The latent variables assigned to each causal factor C i by ψ are denoted as z ψi = {z j |j ∈ 1..M , ψ(j) = i} = {g θ (x t ) j |j ∈ 1..M , ψ(j) = i}; • The remaining latent variables that are not assigned to any causal factor are denoted as z ψ0 ; • In iCITRIS, we learn a directed, acyclic graph G = (V, E) where V = {z ψi |i ∈ 0..K } Published as a conference paper at ICLR 2023 and the edges represent directed causal relations; • The graph G induces a parent structure which we denote by z ψ pa i = {z j |j ∈ 1..M , ψ(j) ∈ pa G (i)} where pa G (0) = ∅, i.e., the variables in z ψ0 having no instantaneous parents; • The parents of a causal variable within the same time step t + 1 are denoted by pa t+1 (C t+1 i ), and the parents of the previous time step t by pa t (C t+1 i ); • As a special case, we denote the function g θ with the parameters θ that precisely model the inverse of the true observation function, h -1 , as the disentanglement function δ * : X → C × Ẽ with C = D M1 × ... × D MK and Mi being the number of latent dimensions assigned to the causal factor C i by ψ * . We denote the output of δ * for an observation X as δ * (X) = ( C1 , C2 , ..., Ẽ). The representation of δ * as a learnable function is denoted by g * θ and ψ * ; • In the following proof, we will use entropy as a measure of information content in a random variable. To be invariant to possible invertible transformations, e.g., scaling by 2, we use the notion of the limiting density of discrete points (LDDP) (Jaynes, 1957; 1968) . In contrast to differential entropy, LDDP introduces an invariant measure m(X), which can be seen as a reference distribution we measure the entropy of p(X) to. The entropy is thereby defined as: H(X) = -p(X) log p(X) m(X) dx In the following proof, we will consider entropy measures over latent and causal variables. For the latent variables, we consider m(X) to be the push-forward distribution of an arbitrary, but fixed distribution in X (e.g., random Gaussian if X = R n ) through g θ . For the causal variables, we consider it to be the push-forward through h -1 . For more details on LDDP, see Lippe et al. (2022b, Appendix A.1.2 ) and Jaynes (1957; 1968) .

D.2 ASSUMPTIONS FOR IDENTIFIABILITY

In this section, we provide a detailed discussion of the assumptions of iCITRIS to enable the identification of an underlying causal graph with instantaneous effects. We thereby focus on why these assumptions are necessary, and how a violation of those can lead to scenarios where the causal variables and graph is not identifiable.

D.2.1 ASSUMPTION 1: THE INTERVENTIONS ON THE CAUSAL VARIABLES REMOVE INSTANTANEOUS PARENTS

iCITRIS requires interventions on the causal variables that remove instantaneous parents, in order to separate the variables in latent space, as stated in Lemma 3.2 and copied here for completeness: Lemma D.1. In iTRIS, a causal variable C i cannot be identified up to an invertible transformation T i that is independent of all other causal variables, if C i can have instantaneous parents and no partially-perfect interventions on C i are provided. Proof. To prove this Lemma, consider a causal variable C i that has C j as an instantaneous parent. The conditional distribution of C i , as defined in iTRIS, can be written as  p i (C t+1 i |C t+1 j , S, I t+1 i ), where S ⊆ C t ∪ C t+1 \ {C t+1 i , C t+1 j }, i. (C t , E t ) = X t . Under this setting, it is sufficient to show that there exist another representation Ĉ that cannot be distinguished from C solely based on observation triples {X t , X t+1 , I t+1 }, and that there exist no invertible function f such that f ( Ĉt i ) = C t i for any t. Note that we exclude a permutation of variables, since the intervention targets I t+1 align the two representations. As an alternative representation, consider Ĉ = {C 1 , ..., C i-1 , C i + C j , C i+1 , ..., C K } with K being the number of causal variables. Then, the distribution of p( Ĉ) only differs in the conditional of p i as Published as a conference paper at ICLR 2023 follows: pi ( Ĉt+1 i | Ĉt+1 j , Ŝ, I t+1 i ) = pi ( Ĉt+1 i |C t+1 j , S, I t+1 i ) (6) = pi (C t+1 i + C t+1 j |C t+1 j , S, I t+1 i ) (7) Because pi is conditioned on C t+1 j , there exist an invertible, volume-preserving transformation w from C t+1 i + C t+1 j to C t+1 i , i.e.,, w(c|C t+1 j ) = c -C t+1 j . Hence, it follows that: pi (C t+1 i + C t+1 j |C t+1 j , S, I t+1 i ) = p i (C t+1 i |C t+1 j , S, I t+1 i ) and overall that p( Ĉ) = p(C). Furthermore, there exist a function ĥ that maps Ĉ to the same observations as h does for C: ĥ( Ĉt , E t ) = h({ Ĉt 1 , ..., Ĉt i-1 , Ĉt i -Ĉt j , Ĉt i+1 , ..., Ĉt K }, E t ) = h(C t , E t ) (9) Therefore, both representations, C and Ĉ, can model the same data generation process for {X t , X t+1 , I t+1 }, and are indistinguishable from these observations alone. Finally, it is apparent that there exist no invertible transformation from Ĉt i to C t i that is independent of C t j . Thus, the causal variable C i is not identifiable up to invertible, componentwise transformations. As an example of how this effects a standard identification problem, consider two random, causal variables C 1 , C 2 with the causal graph C t 1 → C t+1 1 , C t 2 → C t+1 2 . The two causal variables C 1 , C 2 have therefore no instantaneous relations. Further, consider the (soft-interventional) distributions p 1 (C t+1 1 |C 1 t , I t+1 1 ) and p 2 (C t+1 2 |C 1 2 , I 2 ) whose form can be arbitrary, but for this example, we choose them to be Gaussian with constant variance: p 1 (C t+1 1 |C t 1 , I t+1 1 ) = N (C t+1 1 |µ 1 (C t 1 ), σ 1 (C t 1 ) 2 ) if I t+1 1 = 0 N (C t+1 1 |μ 1 (C t 2 ), σ1 (C t 1 ) 2 ) if I t+1 1 = 1 (10) p 2 (C t+1 2 |C t 2 , I t+1 2 ) = N (C t+1 2 |µ 2 (C t 2 ), σ 2 (C t 2 ) 2 ) if I t+1 2 = 0 N (C t+1 2 |μ 2 (C t 2 ), σ2 (C t 2 ) 2 ) if I t+1 2 = 1 (11) where µ 1 , μ1 , µ 2 , μ2 , σ 1 , σ1 , σ 2 , σ2 are arbitrary, potentially non-linear functions of C t 1 and C t 2 respectively. Further, to consider the simplest case, suppose that the observation X t at a time step t are the causal variables themselves, X t = [C t 1 , C t 2 ] , and we observe data points of all intervention settings, i.e., I t+1 i ∼ Bernoulli(q) with 0 < q < 1. Under this setup, the true generative model follows the distribution: p(X t+1 |X t , I t+1 ) = p(C t+1 1 , C t+1 2 |C t 1 , C t 2 , I t+1 1 , I t+1 2 ) (12) = p(C t+1 1 |C t 1 , C t 2 , I t+1 1 , I t+1 2 ) • p(C t+1 2 |C t 1 , C t 2 , I t+1 1 , I t+1 2 ) (13) = p 1 (C t+1 1 |C t 1 , I t+1 1 ) • p 2 (C t+1 2 |C t 2 , I t+1 2 ) ( ) where C t+1 1 ⊥ ⊥ C t+1 2 |X t , I t+1 . To show that the causal variables are not uniquely identifiable, we need at least one other representation which can achieve the same likelihood as the true generative model under all intervention settings I t+1 . For this, consider the following distribution: p(X t+1 |X t , I t+1 ) = p(C t+1 1 , C t+1 2 |C t 1 , C t 2 , I t+1 1 , I t+1 2 ) (15) = p(C t+1 1 |C t 1 , C t 2 , I t+1 1 , I t+1 2 ) • p(C t+1 2 |C t 1 , C t 2 , C t+1 1 , I t+1 1 , I t+1 2 ) (16) = p 1 (C t+1 1 |C t 1 , I t+1 1 ) • p2 (C t+1 1 + C t+1 2 |C t 2 , C t+1 1 , I t+1 2 ) (17) = p 1 ( Ĉt+1 1 |C t 1 , I t+1 1 ) • p2 ( Ĉt+1 2 |C t 2 , Ĉt+1 1 , I t+1 2 ) with Ĉt+1 1 = C t+1 1 , Ĉt+1 2 = C t+1 1 + C t+1 2 . Note the additional dependency of Ĉt+1 2 on Ĉt+1 1 , which is possible in the space of possible causal models with an additional instantaneous causal edge Published as a conference paper at ICLR 2023 The central plot of each subfigure shows a 2D histogram, and the subplots above and on the right show the 1D marginal histograms. For simplicity, we keep the previous time step, X t-1 , constant here. From the interventional distribution, one might suggest that we have the latent causal graph C 1 → C 2 since under I t 1 = 1, the distribution of both observational distributions change, while I t 2 = 1 keeps X 2 unchanged. However, the data has been actually generated from two independent causal variables, which have been entangled by having 4 2 0 2 4 X_1 4 2 0 2 4 X_2 (a) Observational distribution p(X t |X t-1 = C, I t 1 = 0, I t 2 = 0) 4 2 0 2 4 X_1 4 2 0 2 4 X_2 (b) Interventional distribution p(X t |X t-1 = C, I t 1 = 1, I t 2 = 0) 4 2 0 2 4 X_1 4 2 0 2 4 X_2 (c) Interventional distribution p(X t |X t-1 = C, I t 1 = 0, I t 2 = 1) X t = [C t 1 , C t 1 + C t 2 ] . We cannot distinguish between these two latent models from interventions that do not reliably break instantaneous causal effects, showing the need for partially-perfect interventions. Ĉt+1 1 → Ĉt+1 2 . The new distribution p2 is identical to the true distribution, since: p2 (C t+1 1 + C t+1 2 |C t 2 , C t+1 1 , I t+1 2 = 0) = N (C t+1 1 + C t+1 2 |C t+1 1 + µ 2 (C t 2 ), σ 2 (C t 2 ) 2 ) (19) = 1 √ 2πσ 2 (C t 2 ) exp - 1 2 C t+1 1 + C t+1 2 -(C t+1 1 + µ 2 (C t 2 )) 2 σ 2 (C t 2 ) 2 (20) = 1 √ 2πσ 2 (C t 2 ) exp - 1 2 C t+1 2 -µ 2 (C t 2 ) 2 σ 2 (C t 2 ) 2 (21) = N (C t+1 2 |µ 2 (C t 2 ), σ 2 (C t 2 ) 2 ) (22) = p 2 (C t+1 2 |C t 2 , I t+1 2 = 0) Similarly, one can show that p2 (C t+1 1 + C t+1 2 |C t 2 , C t+1 1 , I t+1 2 = 1) = p 2 (C t+1 2 |C t 2 , I t+1 2 = 1). Hence, the alternative representation Ĉt+1 1 , Ĉt+1 2 can model the distribution p(X t+1 |X t , I t+1 ) as well as the true causal model. In conclusion, from the samples alone, we cannot distinguish between the two representation C 1 , C 2 and Ĉ1 , Ĉ2 , and the model is therefore not identifiable up to invertible transformations. An alternative example with a non-trivial observation function is visualized in Figure 4 , which further underlines the problem. This shows that with soft interventions, one cannot distinguish between causal relations introduced by the observation function and those that are in the true causal model. (Partially-)Perfect interventions, however, provide an opportunity to do so since if we had known that the intervention on C 2 renders it independent of C 1 , the second causal model could not have modeled the correct distribution under I 2 = 1. Thus, we can distinguish between the two, allowing us to identify the correct causal model. Note that under partially-perfect intervention, the intervention-independent part of a causal variable, s inv (C t i ), automatically cannot have any instantaneous parents, since otherwise, the intervention does not remove all instantaneous parents and hence is actually not partially-perfect. Published as a conference paper at ICLR 2023

D.2.2 ASSUMPTION 2: THE INTERVENTION VARIABLES ARE NOT A DETERMINISTIC FUNCTION OF EACH OTHER

iCITRIS builds upon interventions to identify the causal variables. The intervention targets are not necessarily independent of each other, but can be confounded. For instance, we could have a setting where we only obtain single-target interventions, or a certain variable C i can only be jointly intervened upon with another variable C j . In this large space of possible experimental settings, we naturally cannot guarantee identifiability all the time. In particular, we require that intervention targets for the different causal variables are unique: Lemma D.2. All information that is strictly dependent on the intervention target I t i , i.e. s var (C i )the minimal causal variable of C i , cannot be disentangled from another causal variable, C j with j ̸ = i, if their intervention targets are identical: ∀t, I t i = I t j . Proof. Lippe et al. (2022b) have shown that two causal variables C i , C j cannot be disentangled from observational data alone if they follow a Gaussian distribution with equal variance over time. Taking this setup, consider that additionally to observational data, we observe samples where both variables have been intervened upon, I t+1 i = I t+1 j = 1. If the interventional distribution of C i and C j are both Gaussian with the same variance, we have the same non-identifiability as in the observational case. Since the entanglement axes can transfer between the two setups, C i and C j cannot be disentangled, and therefore their minimal causal variables. In other words, if two variables are always jointly intervened or passively observed, we cannot distinguish whether information belongs to causal variable C i or C j . Since the causal system is stationary, having one time step t for which I t i ̸ = I t j implies that in the sample limit, we will observe samples with I t i ̸ = I t j in the limit as well. Further, when we only observe joint interventions on two variables, C i , C j , the causal graph among the two variable cannot be identified for arbitrary distributions (Eberhardt, 2007) , making the identifiability of the graph and variables impossible. Following Lippe et al. (2022b) , we require that the following independence holds for every causal variable C i with observed interventions: C t+1 i ̸ ⊥ ⊥ I t+1 i |C t , pa t+1 (C t+1 i ), I t+1 j for any i ̸ = j (24) This also implies that there does not exist a variable C j for which ∀t, I t i = 1 -I t j . As mentioned before, under additional assumptions such that every causal variable has at least one parents, it can be relaxed to unique interventions.

D.2.3 ASSUMPTION 3: DISTRIBUTIONS HAVE FULL SUPPORT

Following several previous works (Brehmer et al., 2022; von Kügelgen et al., 2021) , we consider for the theoretical results that all distributions have full support. If the observational and interventional distribution do not share the same support, there exist data points for which the intervention targets can be determined from the observation X t alone. In such situation, the encoder can change its encoding depending on the intervention target, as long as the decoder can yet recover the full observation. This can potentially create representation models that ignore the latent structure, since the intervention targets are already known. Furthermore, when intervention targets are known from seeing causal variables, we potentially introduce new independencies from intervention targets. For instance, if we have the graph C 1 , C 2 → C 3 where I 3 = 1 only if I 1 = 1, I 2 = 0, we can induce the intervention targets from other causal factors, making C 3 essential independent of I 3 . To prevent such degenerate solutions, we take the assumption that the observational and intervention distributions share the same support. This assumption implies that any data point could come from either the interventional or observational regime, ensuring that the intervention target cannot deterministically be found from an observation X t .

D.2.4 ASSUMPTION 4: TEMPORAL CONNECTIONS AND INTERVENTIONS BREAK ALL SYMMETRIES IN THE DISTRIBUTIONS

The temporal and interventional dependencies are an essential part in iCITRIS to guarantee identifiability and disentanglement of the causal variables. Without any of these dependencies, there may exist multiple representations that model the same distribution p(X t |X t-1 , I t ), while following the Published as a conference paper at ICLR 2023 enforced latent structure by iCITRIS. The problem is that variables can functional dependent on each other, where these dependencies exploit symmetries, leaving the distribution unchanged. C 1 C 2 C 3 For instance, consider the instantaneous causal graph of three variables C 1 , C 2 , C 3 with C 1 , C 3 → C 2 , as depicted in Figure 5 . Suppose that C 1 does not have any temporal parents, and the observational distribution of it follows a Gaussian: p(C t 1 |I t 1 = 0) = N (C t 1 |µ 1 , σ 2 1 ) with µ 1 , σ 2 1 being constants. Further, suppose that under interventions, only the standard deviation changes, i.e. p(C t 1 |I t 1 = 1) = N (C t 1 |µ 1 , σ2 1 ) with σ2 1 ̸ = σ 2 1 . Then, for any point C t 1 = c 1 , there exists a second point, c ′ 1 = 2µ 1 -c 1 , which has the same probability for any value of I t 1 . This is because both distributions, p(C t 1 |I t 1 = 0) and p(C t 1 |I t 1 = 1), share a symmetry around the mean µ 1 . Now, suppose we have the optimal encoder which maps an observation X t of this system to the three causal variables with their ground truth values. Then, there exist an alternative encoder, which flips the observed value of C t 1 around the mean µ 1 , deterministically conditioned on the remaining variables C t 2 and C t 3 . For instance, we could have the following representation Ĉt 1 , Ĉt 2 , Ĉt 3 for the causal variables: Ĉt 2 = C t 2 , Ĉt 3 = C t 3 , Ĉt 1 = C t 1 if Ĉt 3 > 0 2µ 1 -C t 1 otherwise This alternative representation model shares the same likelihood as the optimal encoder in terms of p(X t |X t-1 , I t ), since flipping the value of C t 1 around the mean does not change its probability. Further, despite the flipping, the original observation X t can be recovered from this alternative representation Ĉt by the decoder, because the possible conditioning factors, i.e. Ĉt 3 in this case, are observable to the decoder. Hence, both representations are equally valid for the causal models. Yet, one cannot recover the value of the true causal variable, C t 1 , from its alternative representation Ĉt 1 alone, since Ĉt 3 needs to be known to invert the example condition. This shows that we can have functional dependencies between representations of causal variables while their distributions remain independent. Thus, there exist more than one representation that cannot be distinguished between from having samples of p(X t |X t-1 , I t ) alone. More generally speaking, functional dependencies between variables can be introduced if there exists a transformation that leaves the probability of a variable C i unchanged for any possible value of its parents unseen in X t , i.e. its intervention target I t i and temporal parents C t-1 . Whether this transformation is performed or not can now be conditioned on other variables at time step t. Meanwhile, this transformation does not introduce additional dependencies in the causal graph, since the distribution does not change. To prevent such transformations from being possible, the temporal parents and intervention targets need to break all symmetries in the distributions. We can specify it in the following assumption: Assumption 4: For a causal variable C i and its causal mechanism p(C t+1 i |pa t+1 (C t+1 i ), pa t (C t+1 i ), I t+1 i ), there exist no invertible, smooth transformation T with T (C t+1 i |C t+1 -i ) = Ct+1 i besides the identity, for which the following holds: ∀C t , C t+1 , I t+1 :p(C t+1 i |pa t+1 (C t+1 i ), pa t (C t+1 i ), I t+1 i ) = ∂T (C t+1 i |C t+1 -i ) ∂C t+1 i • p( Ct+1 i |pa t+1 (C t+1 i ), pa t (C t+1 i ), I t+1 i ) Intuitively, this means that there does not exist any symmetry that is shared across all possible values of the parents (temporal and interventions) of a causal variable. While this might first sound restricting, this assumption will likely hold in most practical scenarios. For instance, if the distribution is a Gaussian, then the assumption holds as long as the mean is not constant since the intervention breaks any parent dependencies are broken by the perfect interventions. The same holds in higher Published as a conference paper at ICLR 2023 dimensions, as the new symmetries, i.e. rotations, are yet broken if the center point is not constant. Note that these symmetries can be smooth transformations, in contrast to the discontinuous flipping operation on the Gaussian (i.e., either we flip the distribution or not, but there is no step in between).

D.2.5 ASSUMPTION 5: CAUSAL GRAPH STRUCTURE REQUIREMENTS

Besides disentangling and identifying the true causal variables, we are also interested in finding the instantaneous causal graph. This requires us to perform causal discovery, for which we need to take additional assumptions. First, we assume that the causal graph is acyclic, i.e., for any causal variable C t i , there does not exist a path through the directed causal graph that loops back to it. Note that this excludes different instances over time, meaning that a path from C t i to C t+τ i is not considered a loop. In real-world setups, there potentially exist instantaneous graphs which are not acyclic, which essentially model a feedback loop over multiple variables. However, to rely on the graph as a distribution factorization, we assume it to be acyclic, and leave extension to cyclic causal graphs for future work. As the second causal graph assumption, we require that the causal graph is faithful, which means that all independences between causal variables are implications of the graph structure, not the specific parameterization of the distributions (Hyttinen et al., 2013; Pearl, 2009) . Without faithfulness, the graph might not be fully recoverable. Finally, we assume causal sufficiency, i.e., there do not exist any additional latent confounders that introduce dependencies between variables beyond the ones we model. Note that this excludes the potential latent confounder between the intervention targets, and we rather focus on confounders on the causal variables C 1 , ..., C K besides their intervention targets, the previous time step C t , and instantaneous parents C t+1 .

D.3 THEOREM 3.4 -PROOF OUTLINE

The goal of this section is to proof Theorem 3.4: the global optimum of iCITRIS will identify the minimal causal variables and their instantaneous causal graph. The proof follows a similar structure as Lippe et al. (2022b) used for proofing the identifiability in CITRIS, but requires additional steps to integrate the possible instantaneous relations. In summary, we will take the following steps: 1. (Appendix D.4) Firstly, we show that the function δ * that finds the true latent variables C 1 , ..., C K and assigns them to the corresponding sets z ψ1 , ..., z ψ K constitutes a global, but not necessarily unique, optimum for maximizing the likelihood p(X t+1 |X t , I t+1 ). 2. (Appendix D.5) Next, we characterize the class of disentanglement functions ∆ * which all represent a global maximum of the likelihood, i.e., get the same score as the true function δ * . We do this by proving that all functions in ∆ * must identify the minimal causal variables. 3. (Appendix D.6) In a third step, we show that based on the identification of the minimal causal variables, the causal graph on these learned representations must contain at least the same edges as in the ground truth graph. 4. (Appendix D.7) Finally, we put all parts together and derive Theorem 3.4. We will make use of Figure 6 summarizing the temporal causal graph, and the notation introduced in Appendix D.1. For the remainder of the proof, we assume for simplicity of exposition that: • The invertible map g θ and the prior p ϕ z t+1 |z t , I t+1 are sufficiently complex to approximate any possible function and distribution one might consider in iTRIS. In practice, overparameterized neural networks can approximate most functions with sufficient accuracy. • The sample size for the provided experimental settings is unlimited. This ensures that dependencies and conditional independencies in the causal graph of Figure 6 We start the identifiability discussion by proving the following Lemma: Published as a conference paper at ICLR 2023 C t C t+1 1 C t+1 2 • • • C t+1 K X t+1 E t+1 X t E t I t+1 1 I t+1 2 • • • I t+1 K R t+1 Causal variables This lemma ensures that the true model is part of the solution space of maximum likelihood objective on p(X t+1 |X t , I t+1 ).

Instantaneous

Proof. In order to prove this, we first rewrite the objective in terms of the true causal factors. This can be done by using the causal graph in Figure 6 , which represents the true generative model: p(X t , X t+1 , C t , C t+1 , I t+1 ) = p(X t+1 |C t+1 ) • K i=1 p(C t+1 i |C t , pa t+1 G (C t+1 i ), I t+1 i ) • p(X t |C t ) • p(C t ) • p(I t+1 ) The context variable R t+1 is subsumed in p(I t+1 ), since it is a confounder between the intervention targets and is independent of all other factors given I t+1 . In order to obtain p(X t+1 |X t , I t+1 ) from p(X t , X t+1 , C t , C t+1 , I t+1 ), we need to marginalize out C t and C t+1 , and condition the distribution on X t and I t+1 : p(X t+1 |X t , I t+1 ) = C t+1 C t p(X t+1 |C t+1 ) • K i=1 p(C t+1 i |C t , pa t+1 G (C t+1 i ), I t+1 i ) • p(C t |X t )dC t dC t+1 In the assumptions with respect to the observation function h, we have defined h to be bijective, meaning that there exists an inverse h =1 that can identify the causal factors C t and noise variable E t from X t . Using the invertible map, we can write p(C t |X t ) = δ h -1 (X t )=[C t ;•] , where δ is a Dirac delta. We also remove E t from the conditioning set since it is independent of X t+1 . This leads us to: p(X t+1 |X t , I t+1 ) = C t+1 K i=1 p(C t+1 i |C t , pa t+1 G (C t+1 i ), I t+1 i ), I t+1 i ) • p(X t+1 |C t+1 )dC t+1 Published as a conference paper at ICLR 2023 We can use a similar step to relate X t+1 with C t+1 and E t+1 . However, since we model a distribution over X t+1 , we need to respect possible non-volume preserving transformations. Hence, we use the change of variables formula with the Jacobian J h = ∂h(C t+1 ,E t+1 ) ∂C t+1 ∂E t+1 ) of the observation function h to obtain: p(X t+1 |X t , I t+1 ) = |J h | -1 • K i=1 p(C t+1 i |C t , pa t+1 G (C t+1 i ), I t+1 i ) • p(E t+1 ) Since Equation ( 30) is a derivation of the true generative model p(X t , X t+1 , C t , C t+1 , I t+1 ), it constitutes a global optimum of the maximum likelihood. Hence, one cannot achieve higher likelihoods by reparameterizing the causal factors or having a different graph, as long as the graph is directed and acyclic. In the next step, we relate this maximum likelihood solution to iCITRIS, more specifically, the prior of iCITRIS. For this setting, the learnable, invertible map g θ is identical to the inverse of the observation function, h -1 . In terms of the latent variable prior, we have defined our objective of iCITRIS as: p ϕ z t+1 |z t , I t+1 = K i=0 p ϕ z t+1 ψi |z t , z t+1 ψ pa i , I t+1 i (31) Since we know that g * θ is an invertible function between X and Z, we know that z t must include all information of X t . Thus, we can also replace it with z t = [C t , E t ], giving us: p ϕ z t+1 |C t , E t , I t+1 = K i=0 p ϕ z t+1 ψi |C t , E t , z t+1 ψ pa i , I t+1 i (32) Next, we consider the assignment function ψ * . The optimal assignment function ψ * assigns sufficient dimensions to each causal factor C 1 , ..., C K , such that we can consider z t+1 ψ * i = C t+1 i for i = 1, ..., K. Further, the same graph G is used in the latent space as in the ground truth, except that we additionally condition z ψ * i , i = 1, ..., K on z ψ * 0 . With that, Equation (32) becomes: p ϕ z t+1 |C t , E t , I t+1 = K i=1 p ϕ z t+1 ψ * i = C t+1 i |C t , z t+1 ψ pa i , z t+1 ψ * 0 , I t+1 i • p(z t+1 ψ * 0 |C t , E t ) (33) where we remove E t from the conditioning set for the causal factors, since know that C t+1 and E t+1 is independent of E t . Now, z ψ * 0 must summarize all information of z t+1 which is not modeled in the causal graph. Thus, z ψ * 0 represents the noise variables: z t+1 ψ * 0 = E t+1 . p ϕ z t+1 |C t , E t , I t+1 = K i=1 p ϕ z t+1 ψ * i = C t+1 i |C t , z t+1 ψ pa i , z t+1 ψ * 0 , I t+1 i • p(z t+1 ψ * 0 = E t+1 |C t , E t ) (34) Finally, by using g * θ , we can replace the distribution on z t+1 by a distribution on X t+1 by the change of variables formula: p ϕ X t+1 |C t , E t , I t+1 = ∂g * θ (z t+1 ) ∂z t+1 • K i=1 p ϕ z t+1 ψ * i = C t+1 i |C t , z t+1 ψ pa i , z t+1 ψ * 0 , I t+1 i • p(z t+1 ψ * 0 = E t+1 |C t , E t ) We can simplify this distribution by using the independencies of the noise term E t+1 in the causal graph of Figure 6 : p ϕ X t+1 |C t , E t , I t+1 = ∂g * θ (z t+1 ) ∂z t+1 • K i=1 p ϕ z t+1 ψ * i = C t+1 i |C t , z t+1 ψ pa i , I t+1 i • p(z t+1 ψ * 0 = E t+1 ) With this, Equation (36) represents the exact same distribution as Equation ( 30). Therefore, we have shown that the function δ * that identifies the true latent variables C 1 , ..., C K and assigns them to the corresponding sets z ψ1 , ..., z ψ K constitutes a global optimum for maximizing the likelihood. However, this solution is not necessarily unique, and additional optima may exist. In the next steps of the proof, we will discuss the class of functions and graphs that lead to the same optimum. Published as a conference paper at ICLR 2023 In this section, we discuss the identifiability results of the causal variables in iCITRIS. We first describe the minimal causal variables in iTRIS, and how they differ to TRIS in CITRIS (Lippe et al., 2022b) . Next, we identify the information that must be assigned to individual parts of the latent representation. Finally, we discuss the final setup to ensure identification of the variables according to Definition 3.3, including the additional variables in z ψ0 . D.5.1 MINIMAL CAUSAL VARIABLES Lippe et al. (2022b) introduced the concept of a minimal causal variable as an invertible split of a causal variable s i (C i ) = (s var (C i ), s inv (C i )) into one part that is strictly dependent on the intervention, s var (C i ), and a part that is independent of it, s inv (C i ) (see Definiton 2.1). In other words, the minimal causal variable is the smallest part of a causal variable that strictly depends on the provided intervention. C t pa t+1 (C t+1 i ) I t+1 i C t+1 i (a) Original causal graph of Ci C t pa t+1 (C t+1 i ) I t+1 i s var i C t+1 i s inv i C t+1 i (b) Minimal causal split graph of Ci For iCITRIS, we consider the same concept, but adapt it to the setup of iTRIS. First, iTRIS assumes the presence of interventions that render a variable independent of its instantaneous parents. Hence, when given these interventions, we can ensure that s inv (C i ) does not have any instantaneous parents. Second, the presence of a causal graph in iCITRIS allows dependencies between different parts of the latent space. Further, z ψ0 can be the parent of any other set of variables, thus allowing for potential dependencies between s inv (C i ) and s var (C i ). Note that those for the same time step, however, must also be cut off by the intervention. Hence, the split s i (C t i ) = (s var i (C t i ), s inv i (C t i )) must have the following distribution structure: p s i (C t+1 i )|C t , pa t+1 (C t+1 i ), I t+1 i = p s var i (C t+1 i )|C t , pa t+1 (C t+1 i ), s inv i (C t+1 i ), I t+1 i • p s inv i (C t+1 i )|C t where p s var i (C t+1 i )|C t , pa t+1 (C t+1 i ), s inv i (C t+1 i ), I t+1 i = p s var i (C t+1 i )|C t if I t+1 i = 1 p s var i (C t+1 i )|C t , pa t+1 (C t+1 i ), s inv i (C t+1 i ) otherwise (38) Thereby, the minimal causal variable with respect to its intervention variable I t+1 i is the split s i which maximizes the information content H(s inv i (C t i )|C t ). These relations are visualized in Figure 7 . Causal variables for which the intervention target is constant, i.e., no interventions have been observed, Published as a conference paper at ICLR 2023 were modeled by s inv (C i ) = C i , s var (C i ) = ∅ in CITRIS (Lippe et al., 2022b) . Here, this does not naturally hold anymore since s inv (C i ) is restricted to not having any instantaneous parents. However, as stated in assumption 1, a variable without interventions cannot be an instantaneous child of any variable. Hence, for a causal variable C i , if I t i = 0 for all t, its minimal causal split is defined as s inv (C i ) = C i , s var (C i ) = ∅, as in CITRIS (Lippe et al., 2022b) .

D.5.2 IDENTIFYING THE MINIMAL CAUSAL VARIABLES

As a first step, we postulate the following lemma: Lemma D.4. For all representation functions in the class ∆ * , there exist a deterministic map from the latent representation z ψi to the minimal causal variable s var (C i ) for all causal variables C i , i = 1, ..., K. This lemma intuitively states that the minimal causal variable s var (C i ) is modeled in the latent representation z ψi for any representation that maximizes the likelihood objective. Note that this does not imply exclusive modeling yet, meaning that z ψi can contain more information than just s var (C i ). We will discuss this aspect in Appendix D.5.3. Proof. In order to prove this lemma, we first review some relations between the conditional and joint entropy. Consider two random variables A, B of arbitrary space and dimension. The conditional entropy between these two random variables is defined as H(A|B) = H(A, B) -H(B) (Cover et al., 2005) . Further, the maximum of the joint entropy is the sum of the individual entropy terms, H(A, B) ≤ H(A)+H(B) (Cover et al., 2005) . Hence, we get that H(A|B) = H(A, B)-H(B) ≤ H(A) + H(B) -H(B) = H(A). In other words, the entropy of a random variable A can only become lower when conditioned on any other random variable B. Using this relation, we move now to identifying the minimal causal variables. If a minimal causal variable is the empty set, i.e., s var (C i ) = ∅, for instance due to not having observed interventions on C i , the lemma is already true by construction since no information must be modeled in z ψi . Thus, we can focus on cases where s var (C i ) ̸ = ∅, which implies that C t+1 i ̸ ⊥ ⊥ I t+1 i . Therefore, the following inequality must strictly hold: H(C t+1 i |C t , C t+1 -i ) < H(C t+1 i |C t , C t+1 -i , I t+1 i ) for all i = 1, ..., K. Additionally, based on the assumption that the observational and interventional distributions share the same support, we know that the intervention posterior, i.e., p(I t+1 |X t+1 ), cannot be deterministic for any data point X t+1 and intervention target I t+1 i . Thus, we cannot derive I t+1 i from the observation X t+1 . Thirdly, because every latent variable is only conditioned on exactly one intervention target in iCITRIS and there exist no deterministic function between any pair of intervention targets, one cannot identify I t+1 i in any latent variables except z ψi . Therefore, the only way in iCITRIS to fully exploit the information of the intervention target I t+1 i is to model its dependent information in z ψi . As this information corresponds to the minimal causal variable, s var (C i ), any representation function must model the distribution p(s var (C i )|...) in p(z ψi |I t+1 i , ...) to achieve the maximum likelihood solution. This is independent of the modeled causal graph structure, meaning that if there exist representation functions with different graphs in ∆ * , then all of them must model s var (C i ) in z ψi . Finally, using assumption 4 (Appendix D.2.4), we obtain that this distributional relation implies a functional independence of s var (C i ) in z ψi to any other latent variable. Thus, there exists a deterministic map from z ψi to s var (C i ) in any of the maximum likelihood solutions.

D.5.3 DISENTANGLING THE MINIMAL CAUSAL VARIABLES

The previous subsection showed that z ψi models the minimal causal variable s var (C i ). This, however, is not necessarily the only information in z ψi . For instance, for two random variables A, B ∈ R, the following distributions are identical: p(A) • p(B|A) = p(A) • p(B + A|A) = p(A) • p(B, A|A) The second distribution can add additional information about A arbitrarily to B without changing the likelihoods. This is because the distribution is conditioned on A, and the conditional entropy of a random variable to itself is H(A|A) = H(A, A) -H(A) = H(A) -H(A) = 0. Hence, for Published as a conference paper at ICLR 2023 arbitrary autoregressive distributions, we cannot identify the variables from each other purely by looking at the likelihoods. C 1 C 2 C 3 (a) Observational regime C 1 C 2 C 3 (b) C1 intervened C 1 C 2 C 3 (c) C2 intervened C 1 C 2 C 3 (d) C3 intervened However, in iTRIS, we are given interventions under which variables are strictly independent of their instantaneous parents. With this, we postulate the following lemma: Lemma D.5. For all representation functions in the class ∆ * , z ψi does not contain information about any other minimal causal variable s var (C j ), j ̸ = i, except s var (C i ), i.e., H(z ψi |s var (C i )) = H(z ψi |s var (C i ), s var (C j )). Proof. In order to prove this lemma, we consider all augmented graph structures that are induced by the provided interventions on the instantaneous causal graph. Specifically, given a graph G = (V, E) with V being its vertices and E its edges, and a set of binary intervention targets I = {I 1 , ..., I |V | }, we construct an augmented DAG G ′ = (V ′ , E ′ ), where V ′ = V and E ′ = E \{{pa G (V i ) → V i }|i = 1, ..., |V |, I i = 1}. In other words, the augmented graph G ′ has all its input edges to intervened variables removed. An example for a graph of three variables and its three single-target interventions is shown in Figure 8 . A representation function in the class ∆ * must model the optimal likelihood for all interventionaugmented graphs of its originally learned graph Ĝ, since it cannot achieve lower likelihood for any of the graphs than the ground truth. For every pair of variables C i , C j , assumption 2 (Appendix D.2.2) ensures that there exist one out of three possible experiment sets: (1) we observe I t i = 1, I t j = 0 and I t i = 0, I t j = 1, (2) I t i = 0, I t j = 0, I t i = 1, I t j = 0, and I t i = 1, I t j = 1, or (3) I t i = 0, I t j = 0, I t i = 0, I t j = 1, and I t i = 1, I t j = 1. In all cases, there exist at least one augmented graph in which C t i ⊥ ⊥ C t j , and hence z t ψi ⊥ ⊥ z t ψj , must hold since (2) and (3) observe joint interventions on both variables (I t i = 1, I t j = 1). In (1), a constant connection between the two variables would require both edges C i → C j and C j → C i to be present in the graph, which implies a cycle in a graph violating our acyclicity assumption 5. Under the augmented graph, where z t ψi ⊥ ⊥ z t ψj , the optimal likelihood can only be achieved if the distribution of z t ψi is actually independent of z t ψj , thus not containing any information about s var (C j ). The same holds for z ψj . Hence, a representation function in the class ∆ * must identify the minimal causal variables in the latent space.

D.5.4 DISENTANGLING THE REMAINING VARIABLES

In Appendix D.5.2 and Appendix D.5.3, we have shown that for any solution in the class ∆ * , we can ensure that z ψi models the minimal causal variable s var (C i ), and none other. Still, there exist more dimensions that need to be modeled. The causal variables without interventions, the invariant parts of the causal variables, s inv (C i ), as well as the noise variables E t are part of the generative model that influence an observation X t . All these variables share the property that they are not instantaneous children of any minimal causal variable, and can only be parents of them. This leads to the situation that any of these variables could be modeled in the latent representation of z ψi for an arbitrary i = 1, ..., K as long as C i is the parent of the same variables. The reason for this is that the distribution modeling of such variables is independent of interventions.

Published as a conference paper at ICLR 2023

To exclude them from the causal variable modeling, we follow the same strategy as in CITRIS (Lippe et al., 2022b) by taking the representation function that maximizes the entropy of z ψ0 : Lemma D.6. For all representation functions in the class ∆ * that maximize the information content of p(z ψ0 |C t ) according to LDDP, the latent representation z ψi models exclusively the minimal causal variable s var (C i ) for all causal variables C i , i = 1, ..., K. Proof. Using Lemma D.4 and Lemma D.5, we know that the only remaining information besides the minimal causal variables are the causal variables without interventions, invariant parts of the causal variables, s inv (C i ), as well as the noise variables E t . All these variables cannot be children of the observed, intervened variables, as the assumption 1 (Appendix D.2.1) states. Thus, the remaining information M = {s inv (C 1 ), ..., s inv (C K ), E t } can be optimally modeled by p(M|z t )p(z ψ1 , ..., z ψ K |M, z t , I t+1 ). This implies that there exist a solution where z ψ0 = M, which can be found by searching for the solution with the maximum entropy of p(z ψ0 |C t ). In this solution, the latent representation z ψ1 , ..., z ψ K does not model any subset of M, hence modeling the minimal causal variables exclusively. The overall result is that we identify the minimal causal variables in z ψ1 , ..., z ψ K , and all remaining information is modeled in z ψ0 . Note that the causal variables without interventions, the noise variables and the invariant part of the causal variables can be arbitrarily entangled in z ψ0 . Furthermore, since there exist variables in z ψ0 that may not have any temporal parents (e.g., the noise variables and invariable parts of the intervened causal variables), we cannot rely on assumption 4 (Appendix D.2.4) to ensure functional independence. Hence, while the distribution of p(z ψ0 |z t ) is independent of z ψ1 , ..., z ψ K , there may exist dependencies such that for a single data point, a change in z ψi can result in a change of the noise or invariable parts of the causal variables in the observational space.

D.6 THEOREM 3.4 -PROOF STEP 3: IDENTIFIABILITY OF THE CAUSAL GRAPH

In this step of the proof, we discuss the identifiability of the causal graph under the previous findings. In the first subsection, we discuss what graph we can optimally find under the identification of the minimal causal variables. In the second part, we then show how the maximum likelihood objective is sufficient for identifying the instantaneous causal graph. Finally, we discuss the identifiability of the temporal causal graph.

D.6.1 CAUSAL GRAPH ON MINIMAL CAUSAL VARIABLES

The identification of the causal graph naturally depends on the learned latent representations of the causal variables. In Appendix D.5, we have shown that one can only guarantee to find the minimal causal variables in iTRIS. Thus, we are limited to finding the causal graph on the minimal causal variables s var (C 1 ), s var (C 2 ), ..., s var (C K ) and the additional variables modeled in z ψ0 . The graph between the minimal causal variables is not necessarily equal to the ground truth graph. For instance, consider a 2-dimensional position (x, y) and the color of an object as two causal variables. If the xposition causes the color, but the minimal causal variable of the position is only s var (C 1 ) = y, then the color has only s inv (C 1 ) as parent, not s var (C 1 ). In the learned graph on the latent representation, it would mean that we do not have an edge between z ψ1 and z ψ2 , but instead z ψ0 → z ψ2 . Hence, we might have a mismatch between the ground truth graph on the full causal variables, and the graph on the modeled minimal causal variables. Still, there are patterns and guarantees that one can give for how the optimal, learned graph looks like. Due to the nature of the interventions, the invariable part of a causal variable, s inv (C i ), cannot have any instantaneous parents. Thus, the instantaneous parents of a minimal causal variable s var (C i ) are the same ground truth causal variables as in the true graph, i.e., pa(C i ) = pa(s var (C i )). The difference is how the parents are represented. Since each parent C j ∈ pa(C i ) is split into a variable and invariable part, any combination of the two can represent a parent of s var (C i ). Thus, the learned set of parents for s var (C i ), i.e., pa(z ψi ), must be a subset of {s var (C j )|C j ∈ pa(C i )} ∪ {z ψ0 }. This implies that if there is no causal edge between two causal variables C i and C j in the ground truth causal graph, then there is also no edge between their minimal causal variables s var (C i ) and s var (C j ). The causal graph between the true variables and the minimal causal variables therefore shares a lot of similarities, and in practice, is often almost the same. Published as a conference paper at ICLR 2023 The tables describe the minimal sets of experiments, i.e., unique combinations of I 1 , I 2 in the dataset, that guarantee the intervention targets to be unique, i.e., not ∀t, I t 1 = I t 2 . Under each of these sets of experiments, we show that the maximum likelihood solution of p(C 1 , C 2 |I 1 , I 2 ) uniquely identifies the causal orientation. C 1 C 2 (a) Causal graph Exp. I 1 I 2 E 0 1 0 E 1 0 1 (b) Experimental setting 1 Exp. I 1 I 2 E 0 0 0 E 1 1 0 E 2 1 1 (c) Experimental setting 2 Exp. I 1 I 2 E 0 0 0 E 1 0 1 E 2 1 1 (d) Experimental setting 3 The additional latent variables z ψ0 summarize all invariable parts of the intervened variables, the remaining causal variables without interventions, and the noise variables. Therefore, z ψ0 cannot be an instantaneous child of any minimal causal variable, and we can predefine the orientation for those edges in the instantaneous graph. Next, we can discuss the identifiability guarantees for the graph on the minimal causal variables. For simplicity, in the rest of the section, we refer to identifying the causal graph on the minimal causal variables as identifying the graph on C 1 , ..., C K .

D.6.2 OPTIMIZING THE MAXIMUM LIKELIHOOD OBJECTIVE UNIQUELY IDENTIFIES THE INSTANTANEOUS CAUSAL GRAPH UNDER INTERVENTIONS

Several causal discovery works have shown before that causal graphs can be identified when given sufficient interventions (Brouillard et al., 2020; Eberhardt, 2007; Lippe et al., 2022a; Pearl, 2009) . Since the identification of the causal variables already requires interventions that render variables independent of their instantaneous parents, we can exploit these interventions for learning and identifying the graph as well. In assumption 5 (Appendix D.2.5), we have assumed that the causal graph to identify is faithful. This implies that any dependency between two variables, C 1 , C 2 , which have a causal relation among them (C 1 → C 2 or C 2 → C 1 ), cannot be replaced by conditioning C 1 and/or C 2 on other variables. In other words, in order to optimize the overall likelihood p(C 1 , ..., C K ), we require a graph that has a causal edge between two variables if they are causally related. Now, we are interested in whether we can identify the orientation between every pair of causal variables that have a causal relation in the ground truth graph, which leads us to the following lemma: Lemma D.7. In iTRIS, the orientation of an instantaneous causal effect between two causal variables C i , C j can be identified by solely optimizing the likelihood of p(C i , C j |I i , I j ). Proof. To discuss the identifiability of the causal direction between two variables C 1 , C 2 , we need to consider all possible minimal sets of experiments that fulfill the intervention setup in assumption 2 (Appendix D.2.2). These three sets are shown in Figure 9 . For all three sets, we have to show that the maximum likelihood of the conditional distribution p(C 1 , C 2 |I 1 , I 2 ) can only be achieved by modeling the correct orientation, here C 1 → C 2 . For cases where the true graph is C 2 → C 1 , the same argumentation holds, just with the variables names C 1 and C 2 swapped. As an overview, Table 3 shows the distribution p(C 1 , C 2 |I 1 , I 2 ) under all possible experiments and causal graphs. Experimental setting 1 (Figure 9b ) In the first experimental setting, we are given single target interventions on C 1 and C 2 . In the experiment E 0 which represents interventions on C 1 and passive observations on C 2 , the dependency between C 1 and C 2 persists in the ground truth, i.e., C 1 ̸ ⊥ ⊥ C 2 |I 1 = 1, I 2 = 0. Hence, only causal graphs that condition C 2 on C 1 under interventions on C 1 can achieve the maximum likelihood in E 0 . From Table 3 , we see that the only causal graph that does this is C 1 → C 2 . Thus, when single-target interventions on C 1 are observed, we can uniquely identify the orientation of its outgoing edges. Experimental setting 2 (Figure 9c ) The second experimental setting provides the observational regime (E 0 ), interventions on C 1 with C 2 being passively observed (E 1 ), and joint interventions on Table 3 : The probability distribution p(C 1 , C 2 |I 1 , I 2 ) for all possible causal graphs among the two causal variables C 1 , C 2 under different experimental settings. Observational distributions are denoted with p(...), and interventional with p(...). Note that under interventions, it is enforced that p(...) is not conditioned on any parents, since we work on the instantaneous graph.

Interventions

Causal graph I 1 I 2 C 1 → C 2 C 2 → C 1 C 1 ⊥ ⊥ C 2 0 0 p(C 1 )p(C 2 |C 1 ) p(C 2 )p(C 1 |C 2 ) p(C 1 )p(C 2 ) 1 0 p(C 1 )p(C 2 |C 1 ) p(C 2 )p(C 1 ) p(C 1 )p(C 2 ) 0 1 p(C 1 )p(C 2 ) p(C 2 )p(C 1 |C 2 ) p(C 1 )p(C 2 ) 1 1 p(C 1 )p(C 2 ) p(C 1 )p(C 2 ) p(C 1 )p(C 2 ) C 1 and C 2 (E 2 ). Since the experiment E 1 gives us the same setup as in experimental setting 1, we can directly conclude that the causal orientation C 1 → C 2 is yet again identifiable. Experimental setting 3 (Figure 9d ) In the final experimental setting, C 1 is only observed to be jointly intervened upon with C 2 , not allowing for the same argument as in the experimental settings 1 and 2. However, the causal graph yet remains identifiable because of the following reasons. Firstly, the experiment E 0 with its purely observational regime cannot be optimally modeled by a causal graph without an edge between C 1 and C 2 , reducing the set of possible causal graph to C 1 → C 2 and C 2 → C 1 . Under the joint interventions E 2 , both causal graphs model the same distribution. Still, under the experiment E 1 where only C 2 has been intervened upon, the two distributions differ. The graph with the anti-causal orientation compared to the true graph, C 2 → C 1 , uses the same distribution as in the observational regime to model C 1 , i.e., p(C 1 |C 2 ). In order for this to achieve the same likelihood as the true orientation, it would need to be conditioned on I 2 as the following derivation from the true distribution p(C 1 , C 2 |I 1 , I 2 ) shows: p(C 1 , C 2 |I 1 , I 2 ) = p(C 2 |I 1 , I 2 ) • p(C 1 |C 2 , I 1 , I 2 ) (41) p(C 1 |C 2 , I 1 , I 2 ) = p(C 1 |I 1 ) if I 2 = 1 p(C 1 |C 2 , I 1 ) if I 2 = 0 (42) This derivation shows that p(C 1 |C 2 , I 1 , I 2 ) strictly depends on I 2 if p(C 1 |C 2 , I 1 , I 2 = 1) ̸ = p(C 1 |C 2 , I 1 , I 2 = 0) , which is ensured by C 1 , C 2 not being conditionally independent in the ground truth graph. As the causal graph C 2 → C 1 models C 1 independently of I 2 , it therefore cannot achieve the maximum likelihood solution in this experimental settings. Hence, the only graph achieving the maximum likelihood solution is C 1 → C 2 , such that the orientation can again be uniquely identified. All other, possible experimental settings must contain one of the three previously discussed experiments as a subset, due to assumption 2 (Appendix D.2.2). Hence, we have shown that for all valid experimental settings, optimizing the maximum likelihood objective uniquely identifies the causal orientations between pairs of variables under interventions. Based on these orientations, we can exclude all additional edges that could introduce a cycle in the graph, since we strictly require an acyclic graph. The only remaining non-identified parts of the graph are edges among variables that are independent, conditioned on their parents. In terms of maximum likelihood, these edges do not influence the objective since for two variables C 1 , C 2 with C 1 ⊥ ⊥ C 2 , p(C 1 ) • p(C 2 ) = p(C 1 |C 2 ) • p(C 2 ) = p(C 1 ) • p(C 2 |C 1 ) . Hence, the equivalence class in terms of maximum likelihood includes all graphs that at least contain the true edges, and are acyclic. By requiring structural minimality, i.e., taking the graph with the least amount of edges that fully describes the distribution, we can therefore identify the full causal graph between C 1 , ..., C K .

D.6.3 IDENTIFYING THE TEMPORAL CAUSAL RELATIONS BY PRUNING EDGES

So far, we have shown that the instantaneous causal relations can be identified between the minimal causal variables. Besides the instantaneous graph, there also exist temporal relations between C t and C t+1 , which we also aim to identify: Lemma D.8. In iTRIS, the temporal causal graph between the minimal causal variables can be Published as a conference paper at ICLR 2023 identified by removing the edge between any pair of variables z t ψi , z t+1 ψj with i, j ∈ 0..K , if z t ψi ⊥ ⊥ z t+1 ψj |z t ψ-i , pa t+1 (z t+1 ψj ). Proof. The prior in Equation ( 2) conditions the latents variables z t+1 on all variables of the previous time step, z t . Thus, this corresponds to modeling a fully connected graph from z t ψ0 , z t ψ1 , ..., z t ψ K to z t+1 ψ0 , z t+1 ψ1 , ..., z t+1 ψ K . Since any temporal edge must be oriented from z t to z t+1 , it is clear that the true temporal graph, G T , must be a subset of this graph. Further, since in assumption 5 (Appendix D.2.5), we have stated that the true causal model is faithful, we know that two variables, z t ψi and z t+1 ψj , are only connected by an edge, if they are not conditionally independent of each other: z t ψi ̸ ⊥ ⊥ z t+1 ψj |z t ψ-i , pa t+1 (z t+1 ψj ). This implies that all redundant edges must be between two, conditionally independent variables with: z t ψi ⊥ ⊥ z t+1 ψj |pa t (z t+1 ψj ), pa t+1 (z t+1 ψj ) with pa t (z t+1 ψj ) being a subset of z t ψ-i . Thus, we can find the true temporal graph by iterating through all pairs of variables z t ψi and z t+1 ψj , and remove the edge if both of them are conditionally independent given z t ψ-i , pa t+1 (z t+1 ψj ).

D.7 THEOREM 3.4 -PROOF STEP 4: FINAL IDENTIFIABILITY RESULT

Using the results derived in Appendix D.4, Appendix D.5 and Appendix D.6, we are finally able to derive the full identifiability results. In Appendix D.5, we have shown that any solution that maximizes the likelihood p ϕ,θ,G (x t+1 |x t , I t+1 ) identifies the minimal causal variables of C 1 , ..., C K in z ψ1 , ..., z ψ K . Further, we are able to summarize all remaining variables in z ψ0 by maximizing the entropy (LDDP) of p ϕ (z t+1 ψ0 |z t ). In Appendix D.6, we have used this disentanglement condition to show that the causal graph that maximizes the likelihood must have at least the same edges as the ground truth graph on the minimal causal variables. To obtain the full ground truth graph, we need to pick the one with the least edges. These aspects together can be summarized into the following theorem: Theorem D.9. In iTRIS, a model M * = ⟨θ * , ϕ * , ψ * , G * ⟩ identifies a causal system S = ⟨C, E, h⟩ (Definition 3.3) if M * , under the constraint of maximizing the likelihood p ϕ,θ,G (X t+1 |X t , I t+1 ): (1) maximizes the information content H(z t+1 ψ0 |z t ) in terms of the LDDP (Jaynes, 1957; 1968 ), (2) minimizes the number of edges in G * , and (3) no intervention variables I t i , I t j are deterministically related, i.e., ∀j ̸ = i : ¬(∃f, ∀t :  I t i = f (I t j )).

E DATASETS

The following section gives a detailed overview of the dataset and used hyperparameters in all settings. Appendix E.1 contains the description of the Voronoi benchmark, for which the experimental results are shown in Section 5.2. Appendix E.2 discusses the Instantaneous Temporal Causal3DIdent dataset, and Appendix E.3 the Causal Pinball dataset.

E.1 VORONOI BENCHMARK

The purpose of the Voronoi benchmark is to provide a flexible, synthetic dataset where we can evaluate causal representation learning models on various settings, such as number of variables and graph structure (both instantaneous and temporal). For each dataset, we generate one sequence with 150k time steps, in between which single-target interventions may have been performed. We sample the interventions with 1/(K + 2) for each variable, and with 2/(K + 2) a purely observational regime. A visual example of the Voronoi benchmark is shown in Figure 10 , and we describe its generation steps below.

E.1.1 NETWORK SETUP

In the Voronoi benchmark, we need a data generation mechanism for the conditional distributions p(C t+1 i |pa(C t+1 i )) that support any set of parents. For this, we deploy randomly initialized neural networks which models arbitrary, non-linear relations between any parent set and a causal variable. We visualize the network architecture in Figure 11 . As a simplified setup, we use the neural networks to parameterize a Gaussian distribution. Specifically, the neural networks take as input a subset of C t , C t+1 according to the given graph structure (see next subsection for the graph generation), and output a scalar representing the mean of the conditional distribution N (C t+1 i |µ(pa(C t+1 i )), σ 2 ) where the standard deviation is set to σ = 0.3. We have also experimented with having the (log) standard deviation as an additional output of the network. However, we experienced that this leads to the true causal variables to be the optimal solution when modeling K conditionally independent factors. Hence, both iCITRIS and the baselines were able to identify the causal variables well, making the task easier than anticipated. The interventional distribution is thereby set to N (0, 1) for all causal variables. On the causal variables, we apply a normalizing flow which consisted of six layers: Activation Normalization, Autoregressive Affine Coupling, Activation Normalization, Invertible 1x1 convolution, Autoregressive Affine Coupling, Activation Normalization. The Activation Normalization (Kingma et al., 2018) layers are initialized once after the Batch Normalizations of the distribution neural networks have been set, and ensure that all outputs roughly have a zero mean and standard deviation of one. The Autoregressive Affine Coupling layers use randomly initialized neural networks, with the average standard deviation of the outputs being 0.2. The coupling layer is volume preserving, i.e., we do not use a scaling term in the affine coupling, to prevent any issues with the image quantization. The Invertible 1x1 convolution (Kingma et al., 2018) is initialized with a random, orthogonal matrix, entangling all causal variables across dimensions. Hence, each output of the normalizing flow is influenced by all causal variables. ), I t+1 i = 0) = N (C t+1 i |µ(pa(C t+1 i )), σ 2 ) with σ = 0.3. The BatchNorm layers (Ioffe et al., 2015) are initialized by sequentially sampling 100 batches of the causal variables, using each as the input to the next batch. This ensures that the marginal distribution p(C t+1 i ) has a mean close to zero and standard deviation of one. Finally, the outputs of the normalizing flow are transformed by the function f (x) = 7 8 π • tanh x 2 . This function maps all values to a range of -7 8 π, 7 8 π , which we can use as hues in the patches of the Voronoi diagram. The division by 2 of x is performed to reduce the number of data points in the saturation points of the tanh. The Voronoi diagrams are generated by sampling K points on the image, which have a distance of at least 5 pixels between each other, and are fixed within a dataset. In contrast to just mapping the colors into a grid, the Voronoi diagram is an irregular structure. Hence, the mapping from images to the K color is non-trivial and does not transfer across datasets. Once the Voronoi diagram was created, we have used matplotlib (Hunter, 2007) to visualize the structure as an RGB image. C 1 C 2 C 3 C 4 (a) Graph structure random C 1 C 2 C 3 C 4 (b) Graph structure chain C 1 C 2 C 3 C 4 (c) Graph structure full

E.1.2 GRAPH GENERATIONS

For the instantaneous causal graph, we have considered three graph structures: random, chain, and full. An example of each is visualized in Figure 12 . The random graph samples an edge for every possible pair of variables C i , C j , i ̸ = j with a chance of 0.5. Thereby, we ensure that the graph is acyclic by sampling undirected edges, and directing them according to a randomly sampled ordering of the variables. This way, the average number of edges in the graph is K(K-1)

4

. For small graphs of size 4, this results in variables to eventually having no incoming or outgoing edges, testing also the model's ability on conditionally independent variables. The chain graph connects the variables in a sequence, where each variable is the parent of the next one in the sequence. This leads to each graph having K -1 edges, i.e., the sparsest, yet continuously connected graph. The full graph represents the densest directed acyclic graph possible. We first sample an ordering of variables, and then add an edge from each variable to all others that follow it in the sequence. Thus, it has the most possible edges in a DAG, namely K(K-foot_0) 2 . Finally, the temporal graph is sampled similar to the random graph. However, the orientations are pre-determined by the temporal ordering, and no cycles can occur. We therefore sample a directed edge between any pair of variables C t i , C t+1 j , including i = j, with a chance of 0.25. This leads to an Published as a conference paper at ICLR 2023 average number of edges of K 2 4 . Additionally, we ensure that every variable has at least one temporal parent, to prevent variance collapses in the neural network distributions.

E.2 INSTANTANEOUS TEMPORAL CAUSAL3DIDENT

The creation of the Instantaneous Temporal Causal3DIdent dataset closely followed the setup of Lippe et al. (2022b); von Kügelgen et al. (2021) , and we show an example sequence of the dataset in Figure 13 . We used the code provided by Zimmermann et al. (2021) 1 to render the images via Blender (Blender Online Community, 2021), and used the following seven object shapes: Cow (Crane, 2021) , Head (Rusinkiewicz et al., 2021) , Dragon (Curless et al., 1996) , Hare (Turk et al., 1994) , Armadillo (Krishnamurthy et al., 1996) , Horse (Praun et al., 2000) , Teapot (Newell, 1975) . As a short recap, the seven causal factors are: the object position as multidimensional vector [x, y, z] ∈ [-2, 2] 3 ; the object rotation with two dimensions [α, β] ∈ [0, 2π) 2 ; the hue of the object, background and spotlight in [0, 2π); the spotlight's rotation in [0, 2π); and the object shape (categorical with seven values). We refer to Lippe et al. (2022b, Appendix C.1) for the full detailed dataset description of Temporal Causal3DIdent, and describe here the steps taken to adapt the datasets towards instantaneous effects. The original temporal causal graph of the Temporal Causal3DIdent dataset contains 15 edges, of which 8 are between different variables over time. Those relations form an acyclic graph, which we can directly move to instantaneous relations. Thus, the adjacency matrix of the temporal graph is an identity matrix, while the instantaneous causal graph is visualized in Figure 14 . The causal mechanisms remain unchanged, except that the inputs may now be instantaneous. For instance, the spotlight rotation is adapted as follows: Previous version: rot_s t+1 = f atan2(pos_x t , pos_y t ), rot_s t , ϵ t In image 8, the ball hits a bumper (5 circle centers with light red filling) which lights up. This represents the scoring of a point, as the instantaneous increase in points shows in image 8 (the digits in the bottom right corner). Note that technically, there is no winning or losing state here since we do not focus on learning a policy, but instead a causal representation of the components. Further, not shown here, there exist a fourth channel representing the ball's velocity. the previous time step value. Specifically, for circular values, we use C t+1 i ∼ N (C t i , σ 2 i ) with σ i = 2. For the position variables, we use C t+1 i ∼ N T (0.5 • C t i , σ i ) with N T denoting a truncated Gaussian at [-2, 2] to prevent objects leaving the canvas, and σ i = 1.5. All remaining aspects of the dataset generation are identical to the Temporal Causal3DIdent dataset.

E.3 CAUSAL PINBALL

The Causal Pinball dataset is a simplified environment of the popular game Pinball, as shown in Figure 15 . In Pinball, the user controls two fixed paddles on the bottom of the playing field, and tries to hit the ball such that it collides with various objects for scoring points. There are several versions of Pinball, but for this dataset, we limit it to the essential parts representing the five, multidimensional causal variables: • The ball is defined by four dimensions: the position on the x-and y-axis, and its velocity in maximum is close-to the top of the black border next to it (e.g., image 7 in Figure 15 ), and its minimum is close to the bottom (e.g., image 10 in Figure 15 ). • The right paddle y-position (paddle_right) is similar to paddle_left, just for the right paddle. • The bumpers represent the activation, i.e., the light, of all 5 bumpers. It is a five-dimensional continuous variable, each dimension being between 0 (light off, e.g., image 1 in Figure 15 ) and 1 (light fully on, e.g., image 8 in Figure 15 ). • The score is a categorical variable summarizing the number of points the player has scored. Its value ranges from 0 to a maximum of 20. The dynamics between these causal factors resembles the standard game dynamics of Pinball, which results in the instantaneous causal graph in Figure 16 . The ball can collide with the paddles, borders, and bumpers. When it collides with the borders, it is simply reflected, and we reduce its velocity by 10% (i.e., multiply by 0.9). Under collisions with the paddles, we distinguish between a collision where the paddle has been static or moving backwards, versus a collision where the paddle was moving. When the paddle was static, we use the same collision dynamics as the borders, except that we reduce its y-velocity by 70% to reduce oscillations around the paddle position. When the paddle was moving, we instead set the y-velocity of the ball to the y-velocity of the paddle. Finally, when the ball collides with a bumper, it activates the bumper's light and reflects from it, similar to the borders. When a bumper's light is turned on, we increase the score by one, but include a 5% chance that the score is not increased to introduce some stochastic elements and faulty components in the game. Next to the collisions, the ball is influenced by a gravity towards the bottom, adding a constant every time step to its y-velocity, and friction that reduces its velocity by 2% after each time step. In terms of interventions, we sample the interventions on the five elements independently, but with a chance that would correspond more closely to the game dynamics. Specifically, we intervene on the paddles in 20% of the frames, 10% on the ball, and 5% each the score and bumpers. An intervention of the paddle represents it moving forwards, from its previous position, to a randomly sampled position between the middle and maximum paddle position. Its velocity is set to the difference between the previous position and new position. Since these interventions are usually elements of the standard Pinball game play, we sample them rather often with 20%. An intervention on the ball represents stopping it at the position it is, and give it slightly random velocity towards the bottom. In real-life, this would correspond to a player interfering with the ball by stopping it with their hand. To prevent instantaneous effects from the paddles, we move the ball slightly up if it is in reach of the paddles. An intervention on the bumpers is that we randomly activate a bumper with a 25% chance, while leaving others untouched and maintaining their original dynamics. Finally, an intervention on the score resets it to a random value between 0 and 4. To render the images, we use matplotlib (Hunter, 2007) and a resolution of 64 × 64 pixels. The images are generated by having a single sequence of 150k images. Published as a conference paper at ICLR 2023 Algorithm 1 Pseudocode of the training algorithm for the prior and graph learning in iCITRIS with NOTEARS as graph learning method. For efficiency, all for-loops are processed in parallel in the code. Require: batch of observation samples and intervention targets: B = {x t , x t+1 , I t+1 } N n=1 1: for each batch element x t , x t+1 , I t+1 do 2: Encode observations into latent space: z t = g θ (x t ), z t+1 = g θ (x t+1 ) 3: Differentiably sample one graphs G: G ij ∼ GumbelSoftmax(1 -σ(γ ij ), σ(γ ij )) 4: Sample latent to causal assignments from ψ for each batch element 5: for each causal variable C i do 6: Determine parent mask from G: S ∈ {0, 1} M , S j = G ψ(j),i 7: Calculate nll i =log p ϕ z t+1 ψi |z t , z t+1 ⊙ S, I t+1 i 8: end for 9: Backpropagation loss L n = K i=1 nll i 10: end for 11: Acyclicity regularizer: L cycle = tr (exp(σ(γ))) -K 12: Sparsity regularizer: L sparse = 1 K 2 K i=1 K j=1 σ(γ ij ) 13: Update parameters ϕ, ψ, γ with ∇ ϕ,ψ,γ λ cycle • L cycle + λ sparse • L sparse + 1 N N n=1 L n F EXPERIMENTAL DETAILS In this section, we give further details on implementation details of iCITRIS and hyperparameters that were used for the experiments in Section 5.

F.1 ICITRIS -MODEL DETAILS

Similar to CITRIS, iCITRIS can be either implemented as a VAE or as a normalizing flow trained on the representation of a pretrained autoencoder. The core elements to implement in iCITRIS are: • The map g θ from observations x t to latents z t and back (iCITRIS-VAE: convolutional encoder-decoder of the VAE | iCITRIS-NF: an autoregressive normalizing flow) • The assignment function ψ of latents to causal variables (matrix of R M ×(K+1) from which we sample via Gumbel softmax) • The prior distributions p θ (MLPs for conditional Gaussians) • The continuous-optimization causal discovery method for learning the instantaneous causal graph (ENCO or NOTEARS, see below) The first three are the same as in CITRIS, with the last being novel in iCITRIS. Thus, we discuss implementation details of this graph learning below, as well as the mutual information estimator, which is only necessary for perfect interventions and extra optimization stability. For the two graph learning methods, we additionally discuss the specific setup used to learn the prior distributions p θ . Graph Learning -NOTEARS The full training algorithm of iCITRIS with the NOTEARS graph parameterization is shown in Algorithm 1. The adjacency matrix is parameterized by γ ∈ R (K+1)×(K+1) , where σ(γ ij ), with σ being the sigmoid function, represents the probability of having the edge z ψi → z ψj in the instantaneous graph. To prevent self-loops, we set γ ii = -∞, i = 0, ..., K, and γ i0 = -∞, i = 1, ..., K to guarantee an empty instantaneous parent set for z ψ0 . At each training iteration, we sample an adjacency matrix per batch element using the Gumbel Softmax trick (Jang et al., 2017) . These matrices are used to mask out the inputs to the prior, and therefore obtain gradients by optimizing the likelihood of the prior. Further, NOTEARS requires two regularizers. First, the acyclicity regularizer takes the matrix exponential of the edge probabilities, σ(γ). The trace of this matrix exponential has a minimum of K, which is only achieved if the matrix does not contain any cycles. All operations in this regularizer are differentiable, and we weigh this regularizer in the loss by λ cycle . This weighting factor follows a scheduling over training, which starts with a value of exp(-6) ≈ 2.5e -3, and reaches a maximum of exp(4) ≈ 54.6. In our experiments, this maximum factor ensured the graph to be approximately acyclic. Finally, the second regularizer is a sparsity regularizer, that removes redundant edges and is implemented as a L1 regularizer on the edge probabilities. Distribution fitting Graph fitting !(# ! | … ) !(# " | … ) !(# # | … ) X 1 X 2 X 3 NN 1 NN 2 NN 3 X 1 X 3 X 2 Alternate between both steps

Distribution fitting

Graph fitting ! !" ! "! " !" " #" " !# ! !# ! #! ! #" ! "# L X26 !X3 (X 3 ) 0 ( 12 ) • (✓ 12 ) • [L X1!X2 (X 2 ) L X16 !X2 (X 2 ) + sparse ] 0 ( 32 ) • (✓ 32 ) • [L X3!X2 (X 2 ) L X36 !X2 (X 2 ) + sparse ] 0 ( 13 ) • (✓ 13 ) • [L X1!X3 (X 3 ) L X16 !X3 (X 3 ) + sparse ] 0 ( 23 ) • (✓ 23 ) • [L X2!X3 (X 3 ) L X26 !X3 (X 3 ) + sparse ] @ ✓ 21 L = 0 (✓ 12 ) • ( 12 ) • [L X1!X2 (X 2 ) L X16 !X2 (X 2 )] @ ✓ 31 L = 0 (✓ 13 ) • ( 13 ) • [L X1!X3 (X 3 ) L X16 !X3 (X 3 )] f 1 f 2 f 3 1 L X2!X3 (X 3 ) L X26 !X3 (X 3 ) ( 12 ) • (✓ 12 ) • [L X1!X2 (X 2 ) L X16 !X2 (X 2 ) + sparse ] ( 32 ) • (✓ 32 ) • [L X3!X2 (X 2 ) L X36 !X2 (X 2 ) + sparse ] ( 13 ) • (✓ 13 ) • [L X1!X3 (X 3 ) L X16 !X3 (X 3 ) + sparse ] ( 23 ) • (✓ 23 ) • [L X2!X3 (X 3 ) L X26 !X3 (X 3 ) + sparse ] @ ✓ 21 L = 0 (✓ 12 ) • ( 12 ) • [L X1!X2 (X 2 ) L X16 !X2 (X 2 )] @ ✓ 31 L = 0 (✓ 13 ) • ( 13 ) • [L X1!X3 (X 3 ) L X16 !X3 (X 3 )] f 1 f 2 f 3 1 L X16 !X3 (X 3 ) L X2!X3 (X 3 ) L X26 !X3 (X 3 ) 0 ( 12 ) • (✓ 12 ) • [L X1!X2 (X 2 ) L X16 !X2 (X 2 ) + sparse ] 0 ( 32 ) • (✓ 32 ) • [L X3!X2 (X 2 ) L X36 !X2 (X 2 ) + sparse ] 0 ( 13 ) • (✓ 13 ) • [L X1!X3 (X 3 ) L X16 !X3 (X 3 ) + sparse ] 0 ( 23 ) • (✓ 23 ) • [L X2!X3 (X 3 ) L X26 !X3 (X 3 ) + sparse ] @ ✓ 21 L = 0 (✓ 12 ) • ( 12 ) • [L X1!X2 (X 2 ) L X16 !X2 (X 2 )] @ ✓ 31 L = 0 (✓ 13 ) • ( 13 ) • [L X1!X3 (X 3 ) L X16 !X3 (X 3 )] f 1 f 2 f 3 1 X 1 X 2 X 3 NN 1 NN 2 NN 3 X 1 X 3 X 2 Alternate between both steps Graph fitting or not during the transition from x t to x t+1 . We aim to learn an invertible mapping from observations to a latent space, g ✓ : X ! Z, where the latent space Z follows a certain structure which identifies and disentangles the causal factors C 1 , ..., C K . To do this, we model a probability distribution in the latent space, p (z t+1 |z t , I t ) with z t , z t+1 2 R M , which enforces a disentanglement over latents by conditioning each latent variable on maximum one of the targets. Specifically, we pick the distribution over latents to have the following structure: p (z t+1 |z t , I t ) = K Y i=1 p z t+1 i |z t , I t i (1) tings, there is likely little incentive to mod as a separate variable. (2) When having pe ventions, this requires time-independent va dimensions, and can't occur for temporal dep The setup above has the benefit of not taking any on the distribution or restricting the causal gra having temporal dependencies. Further, in cont other works, a causal variable can be represented latent variables in this setup. This allows the m different levels of causal variables. For instance, of an object can be modeled by multiple late e.g. the position in the three dimensions from observations to a latent space, g ✓ : X ! Z, disentangling the different causal factors. Thereby, we choose the latent space to be larger than the number of causal factors, i.e. Z ✓ R M , M K, such that a single causal factor can be modeled in multiple latent dimensions. This allows the encoding of multidimensional factors, but also benefits the optimization process, since some variables like circular angles or categorical factors with many categories can have simpler distributions when modeled in more dimensions. To implement this setup, we model a probability distribution in the latent space, p (z t+1 |z t , I t+1 ) with z t , z t+1 2 R M being the latent variables for x t and x t+1 respectively. This distribution enforces a disentanglement over latents by conditioning each latent variable on maximum one of the targets: p (z t+1 |z t , I t+1 ) = K Y i=0 p z t+1 i |z t , I t+1 i ( ) where i = {j 2 J1..M K| (j) = i} represents the set of latent variables assigned to the causal variable i and I t+1 0 = ;. (i) is thereby a learnable assignment function which maps each latent variable to one of the intervention targets, : J1..M K ! J0..KK, with (j) = 0 indicating that the latent variable z j does not belong to any intervened causal variable. Then, the objective of the model is to maximize the likelihood p (g ✓ (x t+1 )|g ✓ (x t ), I t+1 ) for all elements in D. Before discussing the identifiability results for the interventional case, we first state that: Proposition 3.1. In general, under the assumptions and setup of Section 3.1, causal factors without a unique set of interventions, cannot be uniquely identified. Take as an example the setup in Figure 2 , where a ball can move in two dimensions, x and y. If both x and y follow a Gaussian distribution over time, then any two orthogonal axes can describe the distribution equally well (Hyvärinen et al., 2001; 2019) , making it impossible to uniquely identify Figure 17 : A visualization of iCITRIS as a VAE framework with using ENCO in its prior. Similar to CITRIS, iCITRIS uses an encoder-decoder structure to map images x t+1 to latents z t+1 and back. The assignment function ψ splits the latent vector z t+1 into K parts (here K = 3), one per causal variable. Between these, we learn a causal graph with ENCO, and condition the variables additionally on the intervention targets I t+1 according to ψ, and the previous time step z t . Graph Learning -ENCO The full training algorithm of iCITRIS with the ENCO graph parameterization is shown in Algorithm 2. The adjacency matrix is parameterized by two sets of parameters, with γ ∈ R (K+1)×(K+1) representing the edge existence parameters, and θ ∈ R (K+1)×(K+1) the orientation parameters, with θ ij = -θ ji . The probability of an edge z ψi → z ψj in the instantaneous graph is determined by σ(γ ij ) • σ(θ ij ). Similar to NOTEARS, we prevent self-loops by setting γ ii = -∞, i = 0, ..., K, and fix the orientations of the edges of z ψ0 by setting θ i0 = -θ 0i = -∞, i = 1, ..., K. In contrast to NOTEARS, this parameterization leads to initial edge probabilities of 0.25. We found it beneficial to initialize the edge probabilities closer to 0.5, which we implement by initializing γ ij = 4, i ̸ = j, i, j ∈ 1..K (σ(4) ≈ 0.98). At each training iteration, we sample L graphs from ENCO. For all experiments, we found L = 8 to be sufficient. For each of these graphs, we evaluate the negative log likelihood of all variables. Note that in contrast to NOTEARS, this does not need to be differentiable with respect to γ and θ. Once all graphs are evaluated, we can determine the average negative log likelihood of a z ψj under graphs with the edge z ψi → z ψj , versus graphs where this edge was missing. We use this to determine the gradients of γ ij and θ ij , if z ψj has not been intervened upon. For the gradients of θ ij , we further mask out gradients for batch samples in which z ψi has not been intervened upon. With these gradients, we can update the graph parameters, while the distribution parameters are updated based on the differentiable negative log likelihood. Note that the sparsity regularizer, λ sparse , is integrated in the update of the γ parameters. Prior networks Both graph learning algorithms use a prior network of the form p ϕ z t+1 ψi |z t , z t+1 ψ pa i , I t+1 i . To implement this efficiently in a neural network setting, we consider for each latent z m , m ∈ 1M a 2-layer neural network (hidden size 32 in all experiments), that take as input z t , z t+1 , I t+1 , and a mask on z t+1 and I t+1 . Therefore, its input size is M +M +K +M +K = 3M + 2K. The mask on I t+1 depends on which causal variable the latent z m has been assigned to, i.e., z t+1 ψi should only depend on I t+1 i . The mask on z t+1 depends on the graph that was sampled, in combination with the causal variable assignment, i.e., only leave z t+1 ψ pa i unmasked. Further, we can use an autoregressive prior over the potentially multiple dimensions of z t+1 ψi by leaving previous latents unmasked that have been assigned to the causal variable C i . We use this autoregressive variant for the Instantaneous Temporal Causal3DIdent and Causal Pinball dataset, since the multiple dimensions in those causal factors may not be independent.

Mutual information estimator

The full training algorithm of the mutual information estimator is shown in Algorithm 3. The MI estimator is a 2 layer network, that takes as input the latent parents of a causal variable, and its current value, and has a single output value. This value indicates whether the current causal variable and its parents match or not, i.e., is z t+1 ψi the value of C i at time step t + 1 based on observing the parents z t and z t+1 ψ pa i , or not. We train this network by a binary classification Published as a conference paper at ICLR 2023 Algorithm 2 Pseudocode of the training algorithm for the prior and graph learning in iCITRIS with ENCO as graph learning method. For efficiency, all for-loops are processed in parallel in the code. Require: batch of observation samples and intervention targets: B = {x t , x t+1 , I t+1 } N n=1 1: for each batch element x t , x t+1 , I t+1 do 2: Encode observations into latent space: z t = g θ (x t ), z t+1 = g θ (x t+1 ) 3: Sample L graphs G 1 , ..., G L from G l ij ∼ σ(θ ij )σ(γ ij ) 4: Sample latent to causal assignments from ψ for each batch element 5: for each graph G l do 6: for each causal variable C i do 7: Determine parent sets for graph G l : z t+1 ψ pa i = {z t+1 j |j ∈ 1..M , ψ(j) ∈ pa G l (i)} 8: Calculate nll l i = -log p ϕ z t+1 ψi |z t , z t+1 ψ pa i , I t+1 i 9: end for 10: end for 11: Backpropagation loss . Since the model does not have the precise time step t or τ as input, it has to deduce from the values of the causal variables whether they match or not. Under interventions, we know that for the true causal variables, the optimal performance of this binary classifier is 0.5, because C t+1 i is independent of all its parents under perfect interventions. Thus, the gradients of the latents is to move the classifier closer to 0.5, which is equal to trying to increase the misclassification rate of the MI estimator. During training, we need to sample instantaneous graphs G from our graph parameterization. Since especially in the beginning, this graph is close to random, and the true causal variables still depend on their children, for instance, under interventions, it can lead to unstable behavior to train the MI estimator on all parents from the start. Thus, instead, we initially train the MI estimator with an empty instantaneous causal graph and try to make z t+1 ψi independent of z t , i.e., its temporal parents. Over the progress of training, we introduce the instantaneous parents, similar to the graph learning scheduling, such that at the end of training, the MI estimator is fully trained on both temporal and instantaneous parents. L n = 1 L K i=1 L l=1 nll l i 12: Average nll for C i → / ̸ → C j : pos_nll n ij = L l=1 G l ij nll l j L l=1 G l ij , neg_nll n ij = L l=1 (1-G l ij )nll l j L-L l=1 G l ij 13: end for 14: Theta gradients: ∇(θ ij ) = σ(γ ij )σ ′ (θ ij ) 1 N N n=1 I n i (1 -I n j )(pos_nll n ij -neg_nll n ij ) 15: Gamma gradients: ∇(γ ij ) = σ(θ ij )σ ′ (γ ij ) 1 N N n=1 (1 -I n j ) pos_nll n ij -neg_nll n ij +

F.2 HYPERPARAMETERS

We have summarized an overview of all hyperparameters in Table 4 . Additionally, we discuss the main hyperparameter choices for all models here. Base VAE architecture For all VAE-based methods, we have applied the same VAE to have a fair comparison between methods. In particular, we have used a VAE with a normalizing flow prior (Rezende et al., 2015) , inspired by the inverse autoregressive flows (Kingma et al., 2016) . The encoder outputs the parameters for M independent Gaussian distributions. A sample of these Gaussians is used as input to the decoder to reconstruct the original image, but also as input to a four-layer autoregressive normalizing flow. This flow consists of a sequence of Activation Normalization (Kingma et al., 2018) , Invertible 1 × 1 Convolutions (Kingma et al., 2018) , and autoregressive affine coupling layers. The outputs of the flow are used as input to a prior, which is conditioned on the latents of the previous time step and the intervention targets. For iCITRIS, this prior follows the structure of Equation (2) including causal discovery. For CITRIS, this prior is similar to Equation (2), except that no instantaneous parents are modeled. For the iVAE, the prior is a 3-layer MLP that outputs M independent Gaussian distributions. Finally, for the iVAE-AR, the prior is a 2-layer Published as a conference paper at ICLR 2023 Algorithm 3 Pseudocode of the training algorithm for the mutual information estimator in iCITRIS. For efficiency, all for-loops are processed in parallel in the code. Require: batch of observation samples and intervention targets: B = {x t , x t+1 , I t+1 } N n=1 1: Encode all observations into latent space: z t = g θ (x t ), z t+1 = g θ (x t+1 ) 2: Sample an instantaneous graph G from graph parameterization 3: Sample latent to causal assignments from ψ 4: for each causal variable C i do end for 14: end for 15: Update parameters of NN MI according to avg loss L NNMI i 16: Backpropagate gradients of latents according to avg loss L zMI i autoregressive NN predicting N Gaussian distributions in sequence. The reconstruction loss is based on the Mean-Squared Error (MSE) objective, which provided much better results than learning a flexible distribution over the output images. The specific architecture of the encoder and decoder depends on the dataset, where we used simpler models where possible to reduce computational cost without losing significant performance. For the Voronoi benchamrk, we use a 5-layer CNN. For the Instantaneous Temporal Causal3DIdent dataset and the Causal Pinball dataset, we used a 10-layer CNN for the encoder, and a 5-layer ResNet (He et al., 2016) as decoder. Autoencoder + Normalizing flow architecture For iCITRIS and CITRIS, we use the variation of training a normalizing flow on a pretrained autoencoder for the Instantaneous Temporal Causal3DIdent and Causal Pinball dataset. The autoencoder uses the same encoder and decoder architecture as the VAE, except that we increase the decoder size since it can be trained much faster than the VAE (does not require any temporal dimension), and, in contrast to the VAE, lead to improvements in the reconstruction for the two datasets. The autoencoder is trained on reconstructing the input images, where we add Gaussian noise with a small standard deviation (0.05) to the latents to simulate a distribution. Additionally, we apply a small L2 regularizer on the latents to prevent that the autoencoder counteracts the noise in the latents by artificially scaling up the standard deviation of the latents. For Causal3DIdent, we use a weight of 1e-5 on this regularizer, and 1e-6 for the Causal Pinball since its reconstructions obtain much lower losses. The normalizing flow, applied on it, follows the same architecture as in the VAE. Optimizer For all models, we use the Adam optimizer (Kingma et al., 2015) with a learning rate of 1e-3. Additionally, we warmup the learning rate for the first 100 steps. Afterwards, we follow a cosine annealing learning rate scheduling, that, over the course of the training, decreases the learning rate to 5e-5. Frameworks All models have been implemented and trained using PyTorch v1.10 (Paszke et al., 2019) and PyTorch Lightning v1.6.0 (Falcon et al., 2019) .

F.3 EVALUATION METRICS

For the details on the correlation matrix evaluation, we refer to Lippe et al. (2022b, Appendix C.3.1) . The causal graph evaluation is performed for each model in the same way. For each model, we use the checkpoint of the best training loss, and encode all observations to the latent space. Next, we need to separate the latent space into the causal variables. For iCITRIS and CITRIS, we use the learned assignment function ψ to assign latent variables to causal variables. Since the iVAE models do not learn such a latent-to-causal assignment, we instead assign each latent variable to the causal Published as a conference paper at ICLR 2023 we apply ENCO (Lippe et al., 2022a) to learn the temporal and instantaneous graph. Since iCITRIS already learns an instantaneous graph, we reuse the learned orientations of the model, and only relearn the edge existence parameters, γ, for potential pruning. In general, we found that the graphs predicted by iCITRIS have a few redundant edges between ancestors and descendants, which occur due to correlations in the early training iterations, and can easily be removed in this post-processing step. As an additional metric to jointly evaluate the disentanglement of the causal variables and the learned causal graph, we use the learned causal graph and distributions by ENCO to sample new data point under novel interventional settings. For each data point in the test dataset, we use the trained model to sample the latents in the next time step, and map them back to the true causal variable space. This mapping is done by a small neural network, trained on the latents of the training dataset. To evaluate how well these samples match the interventional distributions of the true causal model, we train a small discriminator network which tries to distinguish between the true data points in the test set, and the newly generated ones from our model. Only a model that has disentangled the causal variables, and learned the correct causal graph, can perform well on this metric. We show the results for this metric on the Voronoi benchmark and the Instantaneous Temporal Causal3DIDent dataset in Appendix G. graph. In other words, many mistakes are due to predicting too many edges, which easily occurs when causal variables are entangled. However, on the instantaneous graphs, we clearly see that the baselines, CITRIS and iVAE, predict a sparse graph by having a low recall. This also underlines that timator in iCITRIS. The results on this dataset are shown in Table 13 . Besides identifying the causal variables well, iCITRIS-ENCO identifies the instantaneous causal graph with minor errors. Interestingly, CITRIS obtains a good correlation score as well here. This is likely due to the instantaneous effects being very sparse, and perfect interventions giving a very strong preference towards independent variables in this case. Yet, there is still a gap between iCITRIS-ENCO and CITRIS in the instantaneous SHD, showing the benefit of learning the instantaneous graph jointly with the causal variables.



https://github.com/brendel-group/cl-ica



Figure 1: An example causal graph in iTRIS. A latent causal variable C t+1 i can have as potential parents a subset of the causal variables at the previous time step C t = (C t 1 , . . . , C t K ), instantaneous parents C t+1 j , i ̸ = j, and its intervention variable I t+1 i .All causal variables C t+1 and observation noise E t+1 cause the observation h(C t+1 , E t+1 ) = X t+1 . R t+1 is a latent confounder allowing for dependencies between intervention variables.

Figure 3: Example of Causal Pinball.

e., any additional parent set without introducing cycles. We do not put any constraints on the distribution p and also on the provided interventions, except that we do not have the knowledge whether under I t+1 = 1, C t+1 i becomes independent of C t+1 j or not. This implies that one must consider the most general form of interventions for C i , i.e., modeling the distribution p i (C t+1 i under interventions with possible unknown independences. To keep this result general, we consider an arbitrary observation function h

Figure 4: Example distribution for showcasing the necessity of partially-perfect interventions for disentangling causal variables with instantaneous effects. Suppose we are given two-dimensional observations X t , for which the observational and interventional distributions are plotted in (a)-(c).The central plot of each subfigure shows a 2D histogram, and the subplots above and on the right show the 1D marginal histograms. For simplicity, we keep the previous time step, X t-1 , constant here. From the interventional distribution, one might suggest that we have the latent causal graph C 1 → C 2 since under I t 1 = 1, the distribution of both observational distributions change, while I t 2 = 1 keeps X 2 unchanged. However, the data has been actually generated from two independent causal variables, which have been entangled by havingX t = [C t 1 , C t 1 + C t 2 ]. We cannot distinguish between these two latent models from interventions that do not reliably break instantaneous causal effects, showing the need for partially-perfect interventions.

Figure 5: Example instantaneous causal graph between 3 causal variables C 1 , C 2 , C 3 . Without temporal dependencies, we could encode information of C 1 dependent on C 3 without needing an edge in the distribution.

transfer to the observed dataset, and no additional relations are introduced by sample biases. In practice, a large sample size is likely to give an accurate enough description of the true distributions. D.4 THEOREM 3.4 -PROOF STEP 1: THE TRUE MODEL IS A GLOBAL OPTIMUM OF THE LIKELIHOOD OBJECTIVE

Figure 6: An example causal graph in iTRIS. A latent causal factor C t+1 i can have as potential parents the causal factors at the previous time step C t = (C t 1 , . . . , C t K ), instantaneous parents C t+1 j , i ̸ = j, and its intervention variables I t+1 i. All causal variables C t+1 and the noise E t+1 cause the observation X t+1 . R t+1 is a potential latent confounder between the intervention targets.

Figure 7: The minimal causal variable in terms of a causal graph under iTRIS. (a) In the original causal graph, C t+1 i has as potential parents the causal variables of the previous time step C t (eventually a subset), its instantaneous parents pa t+1 (C t+1 i ), and the intervention target I t+1 i . (b) The minimal causal variable splits C t+1 i into an invariable part s inv i C t+1 i

Figure 8: Example instantaneous causal graph between 3 causal variables C 1 , C 2 , C 3 , and the augmented graphs under different single-target interventions that remove instantaneous parent dependencies. The augmented graphs have the edges to the intervened variables removed. For readability, the intervened variables are colored in red in the graphs.

Figure 9: Identifiability of a causal relation between two variables C 1 , C 2 under different interventional settings. (a) The causal relation to consider. The discussion is identical in case of the reverse orientation by switching the variable names C 1 and C 2 . (b-d)The tables describe the minimal sets of experiments, i.e., unique combinations of I 1 , I 2 in the dataset, that guarantee the intervention targets to be unique, i.e., not ∀t, I t 1 = I t 2 . Under each of these sets of experiments, we show that the maximum likelihood solution of p(C 1 , C 2 |I 1 , I 2 ) uniquely identifies the causal orientation.

Figure 10: Example sequences of the Voronoi benchmark for the different graph sizes. Each image of 32×32 is partitioned into K patches. The values of the K true causal variables have been transformed by a two-layer normalizing flow, which result into the hues of the K patches in -7 8 π, 7 8 π . The hues are finally mapped into the RGB space, resulting in the images above.

Figure12: Example instantaneous causal graphs with four variables for the three graph structures. The causal ordering for the causal variables is randomly sampled for each graph to prevent any structural biases.

Figure 13: Example sequence from the training set of the Instantaneous Temporal Causal3DIdent dataset (from left to right, top to bottom). Each image is of size 64 × 64 pixels. One can see the instantaneous effects of the background influencing the object color, for instance, or the object color again influencing the rotation of the object.

Figure 14: The instantaneous causal graph in the Instantaneous Temporal Causal3DIdent dataset.The graph contains several common sub-structures, such as a chain (rot_o→pos_o→rot_s), a fork (hue_o,hue_b→rot_o), and confounders (hue_b→hue_s,hue_o). The most difficult edges to recover include rot_o→pos_o since the object orientation has a complex, non-linear relation to the observation space which is difficult to model and prone to noise. Further, the edge hue_b,hue_s→hue_o only holds for two object shapes (Hare and Dragon), for which the background and spotlight hue have an influence on the object color. For the other five object shapes, the object color is independent of the other two parents.

Figure 15: An example sequence of the Pinball dataset, from left to right, top to bottom. The paddles,i.e., the two gray rectangles in the bottom center, are accelerated forwards under interventions such that they make a large jump within an image. For instance, in image 5, the right paddle has been intervened upon and hits the ball (gray circle). It is accelerated immediately, showcasing the instantaneous effect between the two. When no interventions on the paddles are given, they slowly move backwards. In image 8, the ball hits a bumper (5 circle centers with light red filling) which lights up. This represents the scoring of a point, as the instantaneous increase in points shows in image 8 (the digits in the bottom right corner). Note that technically, there is no winning or losing state here since we do not focus on learning a policy, but instead a causal representation of the components. Further, not shown here, there exist a fourth channel representing the ball's velocity.

Figure16: The instantaneous causal graph in the Causal Pinball dataset. An intervention on the paddles can have an immediate effect on the ball by changing its position and velocity. A change in the ball's position again influences the bumpers, whether their light is activated or not. Finally, when the bumpers are activated, the score increases in the same time step.

ets: Additionally, we asal factors might have been access to the intervention tion targets by the binary 1 refers to an intervention assume that interventions t of each other given the 1 . 1 ventions with an arbitrary empty set (observational h perfect interventions (in independent of the causal which only the conditional target C i given its parents e assume that each latent entified from the observamap f : X ! C from the e causal factor space C.

s over Time s {x t , x t+1 , I t+1 } where rvations at time step t and i and Cj can only be interin the finite data, our identitinguish Ci and Cj. Instead, a joint, coarse variable.

..M , ψ(j) ∈ pa G l (i)} 8: Calculate logits of positive pairs: e pos = NN MI (z t+1 ψi , z t , z element, sample a different, random time step in the batch, τ 10: Calculate logits of negative pairs: e neg = NN MI (z τ +1 ψi , z t , z MI estimator: L NNMI i = -e pos + log [exp(e pos ) + exp(e neg )] 12: Calculate loss for latents: L zMI i = -e neg + log [exp(e pos ) + exp(e neg )] 13:

Figure 18: Learned instantaneous graphs in the Instantaneous Temporal Causal3DIdent dataset for all five models for a single seed. Red arrows indicate false positive edges, and dashed red arrows false negatives. (a) The ground truth of the dataset. (b) iCITRIS-ENCO achieves for one score a perfect recovery of the graph, and for the other two graphs, we miss one edge to hue_o since hue_b only affects it for certain object shapes, and have an additional from the object shape to the rotation due to the complexity of the problem. (c) iCITRIS-NOTEARS had more false positive and negative edges than iCITRIS-ENCO. However, all orientations were correct. (d) CITRIS had in general a sparser graph than the true graph, but in contrast to iCITRIS, also obtained wrong orientations several times (e.g. between pos_o and rot_o). (e) The iVAE obtains very different graphs from the ground truth, with many incorrectly edges. (f) Due to the autoregressive prior in iVAE-AR, we observed a significant amount of false positive edges, with occasional incorrect orientation as well.



Results on the Causal Pinball dataset over three seeds (see Table12for standard deviations).



. In particular, Lachapelle et al. (2022a;b); Yao et al. (2022a;b) discuss the identifiability of causal variables from temporal sequences. As forms of interventions, Lachapelle et al. (2022a;b) consider external actions, while Yao et al. (2022a;b) use non-stationary noise. Yet, in all of these ICA-based setups, causal variables are required to be conditionally independent. Alternatively, Yang et al. (2021) learn causal variables from labeled images in a supervised manner. Given a known causal structure, von Kügelgen et al. (2021) demonstrate that common contrastive learning methods can block-identify causal variables that remain unchanged under augmentations. Locatello et al. (2020a) identify independent latent causal factors from pairs of observations that only differ in a subset of causal factors. Brehmer et al. (2022) extend this setup to variables that are causally related with access to single-target interventions. Similarly, Ahuja et al. (2022) extend the setup of Locatello et al. (2020a) to variables with interdependencies, relying on interventions that only affect their target. All these methods require pairs of counterfactual observations, where only a subset of variables is changed by the intervention, while the rest are frozen, i.e., they keep the same values before and after an intervention. As discussed by Pearl (

x, y, higher-level causal variable consists of all of thWe assume that the un-an unobserved dynamic Kanazawa, 1989; Murphy, (C 1 , C 2 , ..., C K ) with no er Markov (i.e. the causal only be in the previous ameters are time-invariant . As typical in DBNs, we re causally sufficient (i.e. founders).

λ sparse 16: Update theta and gamma with the gradients calculated above 17: Update distribution and assignment parameters ϕ, ψ with ∇ ϕ,ψ

Experimental results of the large-scale study on the Voronoi dataset for predicting the temporal graph, including the recall and precision to highlight false negative and positive predictions.

Experimental results on the Causal Pinball dataset over three seeds.

ACKNOWLEDGMENTS

We thank Johann Brehmer and Pim de Haan for valuable discussions throughout the project. We also thank SURFsara for the support in using the Lisa Compute Cluster. This work is financially supported by Qualcomm Technologies Inc., the University of Amsterdam and the allowance Top consortia for Knowledge and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs and Climate Policy.

AUTHOR CONTRIBUTIONS

P. Lippe conceived the idea, derived the theoretical results, implemented the models and datasets, and wrote the paper. S. Magliacane, S. Löwe, Y. M. Asano, T. Cohen, E. Gavves advised during the project and helped in writing the paper.Published as a conference paper at ICLR 2023 Table 4: Summary of the hyperparameters for all models evaluated on the Voronoi benchmark, Instantaneous Temporal Causal3DIdent dataset, and the Causal Pinball dataset.For all methods, we performed a hyperparameter search over the individual, most crucial hyperparameters (e.g. KLD factor in iVAE). The smaller networks in the latter two datasets for the iVAE architectures were chosen because they require training the full encoder, decoder, and NF at the same time, and larger networks did not show any noticeable improvements. The graph learning warmup in iCITRIS is equal for all datasets, although deviations to e.g. 5k, 15k or 20k often work equally well. Further, the weighting parameters of target classifier and mutual information estimator for iCITRIS , such that, for instance, equally good results were achieved with smaller (5) or higher (20) weights on the Voronoi benchmark. For the graph sparsity regularizer, we used the same value for all graph structures of the same size. factor that it has the highest correlation to. This requires using the ground truth values of the causal variables, and hence gives the iVAE an advantage over iCITRIS and CITRIS. With this separation, Published as a conference paper at ICLR 2023 

G ADDITIONAL EXPERIMENTAL RESULTS AND ABLATION STUDIES

In this section, we list the detailed results of the experiments in Section 5, including the standard deviations over multiple seeds. We further provide results on the metric for predicting intervention outcomes, as described in Appendix F.3. Moreover, we present ablation studies on the Voronoi benchmark to further investigate the limitations of iCITRIS. Finally, we include a visualization of the predicted graph by all models on the Instantaneous Temporal Causal3DIdent dataset and the Causal Pinball environment.

G.1 VORONOI BENCHMARK

The full experimental results for the Voronoi benchmark can be found in Table 5 . Compared to the results in Figure 2 , we also show the results of the discriminator that is trained on newly generated samples from the models. It is apparent that a crucial factor for simulating the true intervention distributions is to have low entanglement across other factors (R 2 sep). Both iVAE and especially iVAE-AR have a strong entanglement between factors, and show a significant gap between the true distribution and their modeled ones. For instance, on the random graphs of size 9, 90% of the samples can be correctly classified from iVAE-AR, indicating that the distributions do not overlap much. In comparison, iCITRIS achieves close-to optimal scores on the small graphs with 55% accuracy only. Note that 50% is already random performance, i.e., the optimum that could be achieved. Still, with larger graphs, we see that the performance also goes down for iCITRIS, although it still outperforms all baselines.Furthermore, to show the specific failure types of the different models on graph prediction, we additionally list the recall and precision of the graph prediction (instantaneous -Table 6 , temporal -Table 7 ). A high recall (max 1.00) reflects that the models are able to recover all edges, while a high precision (max 1.00) shows that the model do not overpredict false positive edges. One key characteristic on all models is that they tend to have a higher recall than precision for the temporal Published as a conference paper at ICLR 2023 the false positive edges in the temporal graph cannot be simply removed by increasing the sparsity regularizer in the causal discovery method, since otherwise, even more edges would be lost in the instantaneous graph. iVAE-AR, on the other hand, has a low precision and recall on the instantaneous graphs; showcasing that it predicts a very different graph with anticausal edges. Meanwhile, only iCITRIS-ENCO obtains a higher recall and precision across the different graph structures and sizes.Next, we look at ablation studies that investigate the applications and limitations of iCITRIS.

G.1.1 ABLATION 1: NOISY INTERVENTION TARGETS

In the first ablation study, we focus on the dependency of iCITRIS on accurate intervention targets.In practice, performing perfect interventions is a difficult task, and is prone to noise. While we can easily observe whether we pushed a button or did external actions to influence a dynamical system, we do not know for sure whether the intervention succeeded or not. This corresponds to a case where the intervention targets, I t , are noisy and tend to have false positives, i.e., I t i = 1 although the intervention did not succeed. How sensitive is iCITRIS to such noise?To investigate this question, we repeat the experiments of the Voronoi benchmark on the random graphs of size 6, but simulate that in 10% of the cases when I t i , we actually do not intervene on C i and instead sample the value from its observational distribution. The results are summarized in Table 8 (left two columns), and clearly show that iCITRIS can yet work well in this setting. The variables are almost as well as before disentangled as before. The additional temporal variables are partially also because of noisy interventions in the post-processing causal discovery setting.

G.1.2 ABLATION 2: EMPTY INSTANTANEOUS GRAPH

The main aspect of iCITRIS in contrast to the baselines is that supports instantaneous effects. However, in practice, we might not know whether instantaneous effects are in the data or not. Thus, this ablation study investigates, whether iCITRIS can yet be used as a replacement of the baselines like CITRIS, when perfect interventions are provided. For this, we repeat the experiments of the Voronoi benchmark on causal models with an empty instantaneous of size 6. As the results in Table 8 show, iCITRIS, CITRIS, and iVAE all are able to identify the causal variables and the graph. This show that iCITRIS can indeed be used as a replacement of CITRIS and iVAE, even in the setting that the variables are independent, conditioned on the previous time step.Published as a conference paper at ICLR 2023 As a final ablation study, we consider the most difficult setup, namely having no temporal relations at all. In this case, all relations between causal variables are purely instantaneous, and we cannot use any information of the previous time step, i.e., z t , as an initial guidance for disentangling the variables. Once more, we repeat the experiments of the Voronoi benchmark on the random graphs of size 6, but with an empty instantaneous graph, and summarize the results in Table 8 (right columns).Due to the difficulty of the task, none of the methods was able to identify the causal variables. Since the probabilities of the edges in iCITRIS are initially around 0.5, the model focuses on finding K independent factors of variations instead of the causal variables. The balance that is crucial for the temporal setup is that the knowledge of the interventions and previous time step is more important than the instantaneous effects for some variables, which, in this case, does not hold. Hence, to overcome this problem, different optimization strategies than the ones discussed in Section 4.3 are needed. Interestingly, even the autoregressive iVAE fails at going beyond finding K independent factors, underlining the difficulty of the task.

G.2 INSTANTANEOUS TEMPORAL CAUSAL3DIDENT

We report the full experimental results for the Instantaneous Temporal Causal3DIdent, including standard deviations across three seeds for all models, are shown in Table 9 . Next to the correlation and graph prediction metrics, we also list the results of the triplet evaluation, following Lippe et al. (2022b) . The triplet distance measures how well we can perform combinations of causal factors in latent space without causing correlations among different factors.

G.2.1 ABLATION STUDY 4: ORIGINAL TEMPORAL CAUSAL3DIDENT WITHOUT INSTANTANEOUS EFFECT

To verify that iCITRIS can be used as a replacement to CITRIS, even in environments where no instantaneous effects are present, we apply iCITRIS to the original Temporal Causal3DIdent dataset. This dataset has the same causal variables and relations, but instead of instantaneous effects, all effects are over temporal time steps. The results are shown in Table 10 . iCITRIS achieves almost identical results to CITRIS, verifying that iCITRIS generalizes CITRIS. In terms of the causal graph prediction, iCITRIS occasionally predicted an instantaneous edge between the object shape and the object rotation; an edge that CITRIS incorrectly predicted over time. This is to be expected due to the visual complexity of the dataset.

G.2.2 ABLATION STUDY 5: PERFECT INTERVENTIONS IN INSTANTANEOUS TEMPORAL CAUSAL3DIDENT

As an ablation to the shown dataset, we conduct an experiment on the Causal3DIdent dataset, where all interventions are perfect. The experimental results for this dataset, including standard deviations across three seeds for all models, are shown in Table 11 . We additionally show the discriminator Published as a conference paper at ICLR 2023 accuracy of distinguishing between true and fake interventional samples. The results indicate that while this makes the task in general a bit easier, iCITRIS-ENCO still performs the best. The small differences in entanglement between iCITRIS-ENCO and iCITRIS-NOTEARS lead to considerable difference for generating new combinations of causal factors, highlighting the importance of strong disentanglement between causal factors. Similarly, the discriminator accuracy shows that iCITRIS-ENCO can accurately model the distribution of the true causal model, while clear differences to the VAE-based baseline, iVAE and iVAE-AR, are visible.To get an intuition on what graphs the different models identify, we have visualized on example of each model in Figure 18 . In general, we see that iCITRIS-ENCO misses only one edge which is sparse anyway, since hue_b affects hue_o only for two shapes. Similarly to the results of Lippe et al. (2022b) , we find that the object shape is a false positive parent of the rotation of the object.For CITRIS, we see that it starts to predict incorrect orientations due to correlations among factors.Finally, iVAE and iVAE-AR predict graphs that have little in common with the ground truth one.To show the importance of the mutual information estimator in iCITRIS, we experiment with iCITRIS but without the MI estimator. The results in Table 11 show that the MI estimator is indeed a crucial performance to reach iCITRIS's strong performance. Without the MI estimator, we experience higher correlations between different latent representations and causal variables. In the end, this also leads to a worsened graph estimation for both instantaneous and temporal effects.

G.3 CAUSAL PINBALL

Finally, the full experimental results for the Causal Pinball environment can be found in Table 12 .Besides the correlation and graph metrics, we again report the triplet evaluation, which shows once more that iCITRIS and CITRIS both work well here. Further, we visualize the predicted causal graphs of the different methods. In general, we found that the most difficult relations are between the paddles and the ball, in particular their orientation. This is due to the deterministic relations between the two factors, such that if the ball has been hit by the paddle, we can already predict it just from the ball position. Further, in many states, the ball and paddle do not affect each other, such that a state where the paddle would have hit the ball, but the ball was intervened upon in the same time step, is extremely rare. Overall, all models suffered from this problem, but iCITRIS showed to handle it.

