UNSUPERVISED LEARNING OF CAUSAL RELATION-SHIPS FROM UNSTRUCTURED DATA Anonymous authors Paper under double-blind review

Abstract

Endowing deep neural networks with the ability to reason about cause and effect would be an important step to make them more robust and interpretable. In this work we propose a variational framework that allows deep networks to learn latent variables and their causal relationships from unstructured data, with no supervision, or labeled interventions. Starting from an abstract Structural Equation Model (SEM), we show that maximizing its posterior probability yields a similar construction to a Variational Auto-Encoder (VAE), but with a structured prior coupled by non-linear equations. This prior represents an interpretable SEM with learnable parameters (such as a physical model or dependence structure), which can be fitted to data while simultaneously learning the latent variables. Unfortunately, computing KL-divergences with this non-linear prior is intractable. We show how linearizing arbitrary SEMs via back-propagation produces local non-isotropic Gaussian priors, for which the KL-divergences can be computed efficiently and differentiably. We propose two versions, one for IID data (such as images) which detects related causal variables within a sample, and one for non-IID data (such as video) which detects variables that are also related over time. Our proposal is complementary to causal discovery techniques, which assume given variables, and instead discovers both variables and their causal relationships. We experiment with recovering causal models from images, and learning temporal relations based on the Super Mario Bros videogame.

1. INTRODUCTION

Human reasoning and decision-making is often underpinned by cause and effect: we take actions to achieve a desired effect, or reason that events would have happened differently had we acted a certain way -or if conditions had been different. Similarly, scientific inquiry uses the same tools, albeit more formalized, to build knowledge about the world and how our society can affect it (Popper, 1962) . When building algorithms that automatically build statistical models of the world, as is common in machine learning practice, it would then be desirable to imbue them with similar inductive priors about cause and effect (Glymour et al., 2016) . In addition to being more robust than statistical models which only characterize the observational distribution (Peters et al., 2017) , they would allow reasoning about changing conditions outside the observed distribution (e.g. counterfactual reasoning). They would also allow communicating their inner workings more effectively -allowing us to ask "why" a given conclusion was reached, much in the same way that we do in scientific communication. Despite still being actively researched, there is now a mature body of work on understanding whether two or more variables are related as cause and effect (Peters et al., 2017) . Many techniques assume that the variables are given, and concern themselves with finding relationship between them (Spirtes & Glymour, 1991; Chickering, 2003; Lorch et al., 2021) . On the other hand, an advantage of modern deep neural networks is that they learn intermediate representations that do not have to be manually labeled (Yosinski et al., 2015) , and effective models can be trained without supervision (Kingma & Welling, 2014 ). An important question then arises: can a deep network simultaneously discover latent variables in the data and establish cause-effect relationships between them? We focus on learning Additive Noise Models (ANM) with Gaussian noise, which are identifiable (i.e. causal directions are distinguishable) as long as the functions relating the variables of interest are not linear (Hoyer et al., 2008) . This model fits well a variational learning framework, and so we are able to derive an analogue of a Variational Auto-Encoder (VAE) (Kingma & Welling, 2014) where the prior, rather than being an uninformative Gaussian, corresponds exactly to the ANM. When the ANM is linear with Gaussian noise, the joint probability of the variables also becomes Gaussian, and it is easy to perform variational inference. The dependencies between variables will then be expressed in the covariance matrix's sparsity structure. However, as mentioned earlier to make the causal directions identifiable the model cannot be linear (Hoyer et al., 2008) . We resolve this difficulty by learning models that are locally linear, but globally non-linear. This approach affords the full generality of a non-linear ANM, with the simplicity of variational inference on Gaussian models. In summary, our contributions are: • A rigorous derivation of the variational Evidence Lower Bound (ELBO) of an Additive Noise Model (ANM), allowing efficient inference of Structural Equation Models (SEM) with deep networks. • A linearization method leveraging automatic differentiation to construct a local Gaussian approximation of arbitrary non-linear ANMs. • A temporally-aware specialization of the causal ANM that encodes causal directions implicit in the arrow-of-time and is suitable for high-dimensional time series data such as video. • Experiments demonstrating that the proposed method is able to fit latent variables with a dependence structure in high-dimensional data, namely a synthetic image dataset and video game based data.

2. RELATED WORK

Our work lies on the intersection of causality, variational inference, representation learning, and high-dimensional unstructured input domains. Causal inference deals with determining the causes and effects from data. Causal discovery methods generally focus on recovering the causal graph responsible for generating the observed data, e.g. Spirtes & Glymour (1991); Chickering (2003) (for an overview of methods in see Peters et al. ( 2017)). However, these methods are largely applied to structured datasets such as medical (Brooks-Gunn et al., 1992; Sachs et al., 2005; Louizos et al., 2017) or economics data LaLonde (1986) where the observed variables are provided by domain specialists. In contrast, we focus on unstructured data where the variables are not provided a priori. Variational inference is a way of performing inference by solving an optimisation problem. A popular instance is the Variational Auto-Encoder (VAE) (Kingma & Welling, 2014) which aims to extract a useful latent representation of the data by encoding and decoding it back. Traditionally the VAE prior is assumed to be an isotropic Gaussian distribution and the aim is to extract independent latent variables such as in the β-VAE (Higgins et al 



., 2016b) and FactorVAE(Kim & Mnih, 2018). There are works which use hierarchical priors such as iteratively conditioning each variable on its preceding variable in the Ladder-VAE(Sønderby et al., 2016)  and conditioning each variable on all its predecessors in NVAE(Vahdat & Kautz, 2020)  andVDVAE (Child, 2021). We also use a prior conditioning each variable on its predecessors but this comes as a natural consequence of basing our prior on a structural equation model (SEM).Recently there has been a growing interest in representation learning based on causal principles. For instance, the CausalVAE (Yang et al., 2021) learns independent latent variables which are then composed to form causal relationships, however they only consider linear relationships between variables. Other works use different approaches to VAEs for causal learning such as the CausalGAN (Kocaoglu et al., 2018) which use generative adversarial networks. Yet another line of work focuses on modelling object dynamics from video such as Li et al. (2020), however they use specialised modules for detecting keypoints and future prediction. Another line of work uses graph neural networks to infer an interaction graph such as Kipf et al. (2018); Löwe et al. (2022), but they do not deal with image or video data. Lippe et al. (2022) focus on causal learning with the knowledge of interventions, whereas we assume no such knowledge. Another line of work such as Lachapelle et al. (2022) and the iVAE (Khemakhem et al., 2020) uses non-linear Independent Component Analysis theory. Locatello et al. (2020) explore using a small number of labeled examples for learning. Walker et al. (2021) use a VQ-VAE for video future prediction using a hierarchical prior but do not focus on causal relationships.

