IDENTIFYING COARSE-GRAINED INDEPENDENT CAUSAL MECHANISMS WITH SELF-SUPERVISION Anonymous

Abstract

Current approaches for learning disentangled representations assume that independent latent variables generate the data through a single data generation process. In contrast, this manuscript considers independent causal mechanisms (ICM), which, unlike disentangled representations, directly model multiple data generation processes (mechanisms) in a coarse granularity. In this work, we aim to learn a model that disentangles each mechanism and approximates the groundtruth mechanisms from observational data. We outline sufficient conditions under which the mechanisms can be learned using a single self-supervised generative model with an unconventional mixture prior, simplifying previous methods. Moreover, we prove the identifiability of our model w.r.t. the mechanisms in the self-supervised scenario. We compare our approach to disentangled representations on various downstream tasks, showing that our approach is more robust to intervention, covariant shift, and noise due to the disentanglement between the data generation processes.

1. INTRODUCTION

The past decade witnessed the great success of machine learning (ML) algorithms, which achieve record-breaking performance in various tasks. However, most of the successes are based on discovering statistical regularities that are encoded in the data, instead of causal structure. As a consequence, standard ML model performance may decrease significantly under minor changes to the data, such as color changes that are irrelevant for the task, but which affect the statistical associations. On the other hand, human intelligence is more robust against such changes (Szegedy et al., 2013) . For example, if a baby learns to recognize a digit, the baby can recognize the digit regardless of color, brightness, or even some style changes. Arguably, it is because human intelligence relies on causal mechanisms (Schölkopf et al., 2012; Peters et al., 2017) which make sense beyond a particular entailed data distribution (Parascandolo et al., 2018) . The independent causal mechanisms (ICM) principle (Schölkopf et al., 2012; Peters et al., 2017) assumes that the data generating process is composed of independent and autonomous modules that do not inform or influence each other. The promising capability of causal mechanisms grows an activate subfield (Parascandolo et al., 2018; Locatello et al., 2018a; b; Bengio et al., 2019) . Recent works define the mechanisms to be: 1) functions that generate a variable from the cause (Bengio et al., 2019) , 2) functions that transform the data (e.g. rotation) (Parascandolo et al., 2018) , and 3) a disentangled mixture of independent generative models that generate data from distinct causes (Locatello et al., 2018a; b) . Throughout this paper, we refer to type 2) mechanisms as shared mechanisms and type 3) mechanisms as generative mechanisms. Despite the recent progress, unsupervised learning of the generative and shared mechanisms from complex observational data (e.g. images) remains a difficult and unsolved task. In particular, previous approaches (Locatello et al., 2018a; b) for disentangling the generative mechanisms rely on competitive training, which does not directly enforce the disentanglement between generative mechanisms. The empirical results show entanglement. Additionally, Parascandolo et al. ( 2018) proposed a mixture-of-experts-based method to learn the shared mechanisms using a canonical distribution and a reference distribution, which contains the transformed data from the canonical distribution. Such a reference distribution is generally unavailable in real-world datasets. To create a reference distribution, we need to use the shared mechanisms that we aim the learn. This causes a chicken-egg problem. Besides, the unsupervised learning of the deep generative model is proved to be unidentifi-able (Locatello et al., 2019; Khemakhem et al., 2020) . Lacking identifiability makes it impossible to learn the right disentangled model (Locatello et al., 2019) . Recent methods (Locatello et al., 2020; Khemakhem et al., 2020) leverage weak-supervision or auxiliary variables to identify the right deep generative model. However, such weak-supervision or auxiliary variables still do not exist in conventional datasets (e.g. MNIST). We, therefore, seek a practical algorithm with identifiability result that disentangles the mechanisms from i.i.d data without manual supervision. To this end, we propose a single self-supervised generative model with an unconventional mixture prior. In the following sections, we refer to our model as the ICM Using a single self-supervised generative model would allow us to leverage the recent progress in deep generative clustering (Mukherjee et al., 2019) , which would enforce the disentanglement between the generative mechanisms. We use the following example to illustrate the relationship between the generative model and the mechanisms. Let us assume we have a generative model G : Z -→ X , two generative mechanisms M 0 : Z M0 - → X M0 , M 1 : Z M1 - → X M1 , and one shared mechanism M S : X M , Z S - → X where Z = [Z M0 , Z M1 , Z S ] and X M = X M0 ∪ X M1 . We have G([z M0 , 0, z S ]) = M S (M 0 (z M0 ), z S ) and G([0, z M1 , z S ]) = M S (M 1 (z M1 ), z S ). Our mixture prior is unconventional because the mixture components are {[N (0, I), 0, N (0, I)], [0, N (0, I), N (0, I)]} instead of {N (µ 0 , σ 2 0 ), N (µ 1 , σ 2 1 )}. To keep the notations clear, we omit the normalization factor. We disentangle the generative mechanisms by disentangling the type of variations (causes) carried by each z M k , ∀k ∈ {1, 2, ..., N } where N is the number of generative mechanisms. The disentanglement between the generative mechanisms and the shared mechanisms will be guaranteed by the prior itself. Furthermore, we theoretically prove that the ICM model is identifiable w.r.t. the mechanisms without accessing any label. The key contributions of this paper are: • We propose a simpler method to learn the mechanisms with only self-supervision. • We design an unconventional mixture prior that enforce disentanglement. • We prove the first identifiability result w.r.t. the mechanisms in the self-supervised scenario. • We develop a novel method to quantitatively evaluate the robustness of ML models under covariant shift using the covariant that is naturally encoded in the data. • We conduct extensive experiments to show that our ICM model is more robust against intervention, covariant shift, and noise compared to disentangled representations.

2. RELATED WORK

Functional Causal Model In functional causal model (FCM), the relationships between variables are expressed through deterministic, functional equations: x i = f i (pa i , u i ), i = 1, ..., N . The uncertainty in FCM is introduced via the assumption that variables u i , i = 1, ..., N , are not observed (Pearl et al., 2000) . If each function in FCM represents an autonomous mechanism, such FCM is called a structural model. Moreover, if each mechanism determines the value of one and only one variable, then the model is called a structural causal model (SCM). The SCMs form the basis for many statistical methods (Mooij & Heskes, 2013; Mooij et al., 2016 ) that aim at inferring knowledge of the underlying causal structure from data (Bongers et al., 2016) . Taking the view from the SCM's perspective, we want to learn a mixture of causal models whose inputs are pure latent variables and whose output is a single high-dimensional variable that describes complex data such as images. Different from other SCM approaches, where the unobserved variables only introduce uncertainty to the model, the latent variables in our model carries distinct variations in the dataset. Independent Component Analysis Discovering independent components of the data generating process has been studied intensively (Hyvärinen & Oja, 2000; Hyvarinen et al., 2019) . A recent work (Khemakhem et al., 2020) bridges the gap between the nonlinear independent component analysis (ICA) and the deep generative model. The nonlinear ICA with auxiliary variables brings parameter-space identifiability to variational auto-encode. The nonlinear ICA tackles the parameterspace identifiability of deep generative models. However, the parameter-space identifiability does not guarantee the disentanglement between causes. We will discuss the difference in section 4. Disentangled Representations Disentangled representations assume that the data is generated using a set of independent latent explanatory factors (Bengio et al., 2013) . Previous works (Higgins

