INFERRING PRINCIPAL COMPONENTS IN THE SIMPLEX WITH MULTINOMIAL VARIATIONAL AUTOENCODERS

Abstract

Covariance estimation on high-dimensional data is a central challenge across multiple scientific disciplines. Sparse high-dimensional count data, frequently encountered in biological applications such as DNA sequencing and proteomics, are often well modeled using multinomial logistic normal models. In many cases, these datasets are also compositional, presented item-wise as fractions of a normalized total, due to measurement and instrument constraints. In compositional settings, three key factors limit the ability of these models to estimate covariance: (1) the computational complexity of inverting high-dimensional covariance matrices, (2) the non-exchangeability introduced from the summation constraint on multinomial parameters, and (3) the irreducibility of the multinomial logistic normal distribution that necessitates the use of parameter augmentation, or similar techniques, during inference. Using real and synthetic data we show that a variational autoencoder augmented with a fast isometric log-ratio (ILR) transform can address these issues and accurately estimate principal components from multinomially logistic normal distributed data. This model can be optimized on GPUs and modified to handle mini-batching, with the ability to scale across thousands of dimensions and thousands of samples.

2. INTRODUCTION

Many scientific disciplines that collect survey data, such as economics, psychology, political science and the biological sciences routinely deal with compositional data, where only relative information can be measured. These datasets are often in the form of counts, where the total counts within a sample are only indicative of the confidence of the measured proportions. The resulting proportions lie within a simplex and failing to account for the structure of this simplicial sample space can confound the interpretation of the measurements. As a result, there has been wide discussion across disparate disciplines (1; 2; 3; 4) concerning the reproducibility crisis that has arisen from the misinterpretation of compositional data. One of the obstacles to the appropriate analysis of compositional data is the difficulty of efficiently estimating the latent parameters that lie in the simplex. Accurately scaling probabilistic inference across high-dimensional count data is a major outstanding challenge (5). This problem is apparent in the social sciences and is particularly pronounced in biological fields where datasets can obtain observations on tens of thousands of features across hundreds or millions of samples. One major computational bottleneck with Gaussian distributed data is the inversion of a d-dimensional covariance matrix that has a runtime of O(d 3 ) (6; 7). As a result, probabilistic covariance estimation for high-dimensional data is a computationally challenging problem. Recent theoretical developments (8) cementing the connection between Variational Autoencoders (VAEs) (9) and Probabilistic Principal Components Analysis (PPCA) (10) holds much promise for enabling accurate, scalable, low-rank approximations of large covariance matrices. Variational autoencoders were originally proposed as a generative model ( 9), but are now commonly deployed across scientific disciplines and have made contributions to single-cell RNA sequencing (11), microbiome modeling (12), protein modeling (13; 14; 15), natural language processing (16) and image processing (9). Following insights that connected regularized linear autoencoders and PCA (17), Lucas et al. (8) showed that carefully designed VAEs can recover the weights that are solved by PPCA. A computational advantage of VAEs is that they do not require the inversion of a covariance matrix, and the resulting runtime is O(ndkT ) for n samples, d dimensions, k latent dimensions and T epochs. While it has been noted that VAEs may take tens of thousands of epochs to estimate the principal component (18), VAEs are easily parallelizable and can be accelerated with GPUs, presenting an attractive alternative to estimating principal components (17) and the resulting covariance matrix. The connection between VAEs and PPCA is currently limited to Gaussian distributed data and not well-suited to a compositional setting. Showing that VAEs can recover the correct principal components from count data is nontrivial due to the non-conjugacy issues between the logistic normal distribution and count distributions such as the multinomial distribution. Furthermore, the parameters of the multinomial distribution are compositional; they are constrained within the simplex and the resulting covariance matrix is singular and non-invertible (1; 19). Aitchison (20) showed that PCA can be adapted to compositional data through the use of the center log-ratio (CLR) transform, which maintains isometry. However, this transformation is not isomorphic, requiring that the resulting log-ratios sum to zero, and as a result, CLR-transformed data will produce a singular covariance matrix and rank-deficient principal components. It has been shown that the isometric log-ratio (ILR) transform ( 21) satisfies both isomorphism and isometry and can handle this singularity issue (22; 23) while enabling the estimation of full-rank principal components. Here, we show that VAEs augmented with the ILR transform can infer principal components learned from PPCA on multinomially distributed data, beginning to address these critical shortcomings.

3. RELATED WORK

In the microbiome literature, there have been a number of methods (24; 25; 26; 27; 28) that have attempted to model ecological networks through the estimation of pairwise microbe correlations or pairwise inverse-covariance, where microbes are aggregated at different taxonomical scales or 'taxa'. Of these tools, only Flashweave can scale across more than thousands of taxa; however, it does this by avoiding the estimation of the covariance matrix. Methods that attempt to estimate the covariance matrix can only handle on the order of a few thousand dimensions. Although there is no widely accepted consensus definition of Multinomial PPCA in this context, being able to efficiently estimate the parameters of Multinomial PPCA would be highly useful for exploratory biological analysis. A number of studies have proposed using mixture modeling as a proxy for PCA (29; 30; 31); however, these techniques depend either on the Dirichlet distribution, whose covariance matrix is not flexible, or on stick-breaking, which violates permutation invariance (32). 8) has previously shown that the following two models can obtain the same maximum likelihood estimates of principal components W :

Lucas et al. (

Probabilistic PCA p(x|z) = N (W z + µ, σ 2 I d ) p(z) = N (0, I k ) Linear VAE p(x|z) = N (W z + µ, σ 2 I d ) q(z|x) = N (V (x -µ), D) Here, p(x|z) denotes the likelihood of observations x ∈ R d given the latent representation z ∈ R k , p(z) denotes the prior on z and q(z|x) denotes the estimated variational posterior distribution of z given an encoder parameterized by V and diagonal variances D. Both models estimate the same low dimensional representation of the data through z, and learn the same factorization of the covariance matrix through W . While PPCA parameters are typically estimated through expectation maximization (10), linear VAEs are optimized by maximizing the Evidence Lower Bound (ELBO) given by log p(x) ≥ E q(z|x) log p(x|z) -KL q(z|x) p(z) For linear VAEs with a Gaussian likelihood, the variational posterior distribution q(z|x) can be shown to analytically agree with the posterior distribution p(z|x) learned from PPCA (8). However, deriving this connection for count-based likelihoods such as the multinomial distribution is complicated due to non-conjugacy issues (Appendix A). This is a major obstacle for many biological applications; multiple works have shown the merits of incorporating count distributions explicitly into the model (33; 34; 35; 36) . Here, we provide directions for overcoming this issue.

4. METHODS

First, we will redefine Multinomial PPCA with the ILR transform (21). Then we will make the connection between Multinomial VAEs and Multinomial PPCA by leveraging insights from the

