MIXTURE REPRESENTATION LEARNING WITH COU-PLED AUTOENCODING AGENTS

Abstract

Jointly identifying a mixture of discrete and continuous factors of variability can help unravel complex phenomena. We study this problem by proposing an unsupervised framework called coupled mixture VAE (cpl-mixVAE), which utilizes multiple interacting autoencoding agents. The individual agents operate on augmented copies of training samples to learn mixture representations, while being encouraged to reach consensus on the categorical assignments. We provide theoretical justification to motivate the use of a multi-agent framework, and formulate it as a variational inference problem. We benchmark our approach on MNIST and dSprites, achieving state-of-the-art categorical assignments while preserving interpretability of the continuous factors. We then demonstrate the utility of this approach in jointly identifying cell types and type-specific, activity-regulated genes for a single-cell gene expression dataset profiling over 100 cortical neuron types.

1. INTRODUCTION

Complex phenomena can be attributed to a mixture of discrete and continuous factors of variability. Such complexity is crucial to understand in a variety of different contexts, from learning models for image datasets to identifying factors underlying neuronal identity. A common approach to study these phenomena is clustering, which can produce representations that jointly capture the dependence on discrete and continuous factors. Generative models can learn such representations, which has recently received attention from the deep learning community. Deep Gaussian mixture models are among the first deep generative models to jointly represent discrete and continuous factors, in which a continuous representation is decomposed into discrete clusters (Johnson et al., 2016; Dilokthanakul et al., 2016; Jiang et al., 2017) . However, such models have mainly focused on clustering without regard to interpretability. Adversarial and variational methods have been proposed to learn mixture representations that can identify interpretable continuous factors. While adversarial learning, e.g. InfoGAN (Chen et al., 2016) is susceptible to stability issues (Kim & Mnih, 2018; Dupont, 2018; Jeong & Song, 2019) , variational approaches, e.g. JointVAE and CascadeVAE have produced promising and more stable results (Dupont, 2018; Jeong & Song, 2019) . However, such variational methods utilizing a single autoencoding agent rely either on a heuristic data-dependent embedding capacity, or on solving a separate optimization problem for the discrete variable. Thus, learning interpretable and stable mixture representations remains challenging. We introduce a multi-agent variational framework to jointly infer discrete and continuous factors through collective decision making, while sidestepping heuristic approaches used by single-agent frameworks. Coupling of autoencoding agents has been previously studied in the context of multimodal recordings, where each agent learns a continuous latent representation for one of the data modalities (Feng et al., 2014; Gala et al., 2019) . Here, we propose pairwise-coupled autoencoders to learn a mixture representation for a single data modality in an unsupervised fashion. Each autoencoder agent receives an augmented copy of the given sample with the same class label. To achieve this, we design a novel type-preserving augmentation that generates noisy copies of the data using withinclass variabilities, while preserving its class identity. Coupling across the agents is achieved by encouraging categorical variables to be invariant under the augmentation, which regularizes the agents to learn interpretable representations. We demonstrate that such a coupled multi-agent architecture can increase inference accuracy and robustness by exploiting within-cluster variabilities, without requiring a prior distribution on the relative abundances of categories. Our contributions can be summarized as follows: (i) We first provide theoretical justification to motivate the advantage of collective decision making for more accurate categorical assignments, utilizing noisy copies of the same sample. To obtain such samples, we propose an unsupervised type-preserving augmentation method. (ii) We formulate collective decision making as a variational inference problem with multiple agents. In this formulation, we introduce an approximation of Aitchison distance in the simplex to compare categorical assignments of the agents, which avoids mode collapse. (iii) We benchmark our method and display its superiority over comparable approaches using the MNIST and dSprites datasets. (iv) Finally, we apply the method to a challenging single cell gene expression dataset for a population of neurons. We demonstrate that our method can be used to discover discrete categories referred to as neuronal types and type-specific genes regulating the continuous within-type variability (e.g., metabolic states, disease states).

2. RELATED WORK

As introduced above, recent studies on joint learning of discrete and continuous factors in generalized mixture models focus on variational or adversarial approaches (Dupont, 2018; Jeong & Song, 2019; Chen et al., 2016) . There is also a rich literature that focus on clustering in mixture models and do not attempt to characterize the continuous variability: Dilokthanakul et al. ( 2016 (Friedman, 2001) to iteratively fit mixture models in variational frameworks. Moreover, the idea of improving the clustering performance through seeking a consensus across similar agents has been explored in both unsupervised (Monti et al., 2003; Kumar & Daumé, 2011) and semi-supervised contexts (Blum & Mitchell, 1998) . However, the proposed consensus clustering approach is attempting to learn an interpretable continuous variability. Here, in contrast, beyond joint disentangling, we propose a framework in which the agents seek a consensus, at the time of learning the mixture representation. While our method does not assume any prior/supervising information, the individual agents in our approach can be considered to provide a form of weak supervision to each other. Bouchacourt et al. ( 2017) demonstrated a multi-level variational autoencoder as a weak supervised disentanglement approach for different factors of variability by both revealing that observations within groups share the same class label where the class label variable takes values from a finite set of labels and are represented by a Gaussian distribution. Recently, Locatello et al. (2020) improved this framework by assuming that observation pairs share at least one underlying factor, and demonstrated disentangling of continuous (but not discrete) factors on image sets.

3. PRELIMINARIES

For an observation x ∈ R D , a variational autoencoder (VAE) learns a generative model p θ (x|z) and a variational distribution q φ (z|x), where z ∈ R M for M D is a latent variable with a parameterized distribution p(z) (Kingma & Welling, 2013) . Disentangling different sources of variability into different dimensions of z enables an interpretable selection of latent factors (Higgins et al., 2017; Locatello et al., 2018a) . However, in many real-world applications the inherent mixture distribution of continuous and discrete variations is often overlooked. This problem can be addressed within the VAE framework in an unsupervised fashion by introducing a categorical latent variable c ∈ S K , denoting the class label defined in a K-simplex, alongside the continuous latent variable s ∈ R M . Here, we refer to the continuous variable s as the state or style variable interchangeably. Assuming s and c are independent random variables, the evidence lower bound (ELBO) (Blei et al., 2017) for a single autoencoding agent with the distributions parameterized by θ and φ is given by, L(φ, θ) = E q φ (s,c|x) [log p θ (x|s, c)] -D KL (q φ (s|x) p(s)) -D KL (q φ (c|x) p(c)) . (1) Maximizing ELBO in Eq. 1 to jointly learn q(s|x) and q(c|x) is challenging due to the mode collapse problem, where the network ignores a subset of latent variables. Akin to β-VAE (Higgins et al., 2017; Burgess et al., 2018) , JointVAE assigns controlled information capacities to both continuous and categorical factors to prevent mode collapse (Dupont, 2018) . A drawback of this method is that the capacities are dataset-dependent, and need to be tuned empirically over training iterations. As an alternative, CascadeVAE (Jeong & Song, 2019) maximizes the ELBO by iterating over two separate optimizations for the continuous and categorical variables after a warm-up period, instead of a fully gradient-based optimization. Although the computational cost for the suggested optimization for the categorical variable has an approximately linear dependence on the number of categories and



) and Jiang et al. (2017) performed variational inference in mixture models using autoencoding architectures. Tian et al. (2017) applied the alternating direction method of multipliers (ADMM) to use classical clustering algorithms in conjunction with a neural network. Guo et al. (2016) and Locatello et al. (2018b) used gradient boosting approaches

