INFERRING PRINCIPAL COMPONENTS IN THE SIMPLEX WITH MULTINOMIAL VARIATIONAL AUTOENCODERS

Abstract

Covariance estimation on high-dimensional data is a central challenge across multiple scientific disciplines. Sparse high-dimensional count data, frequently encountered in biological applications such as DNA sequencing and proteomics, are often well modeled using multinomial logistic normal models. In many cases, these datasets are also compositional, presented item-wise as fractions of a normalized total, due to measurement and instrument constraints. In compositional settings, three key factors limit the ability of these models to estimate covariance: (1) the computational complexity of inverting high-dimensional covariance matrices, (2) the non-exchangeability introduced from the summation constraint on multinomial parameters, and (3) the irreducibility of the multinomial logistic normal distribution that necessitates the use of parameter augmentation, or similar techniques, during inference. Using real and synthetic data we show that a variational autoencoder augmented with a fast isometric log-ratio (ILR) transform can address these issues and accurately estimate principal components from multinomially logistic normal distributed data. This model can be optimized on GPUs and modified to handle mini-batching, with the ability to scale across thousands of dimensions and thousands of samples.

2. INTRODUCTION

Many scientific disciplines that collect survey data, such as economics, psychology, political science and the biological sciences routinely deal with compositional data, where only relative information can be measured. These datasets are often in the form of counts, where the total counts within a sample are only indicative of the confidence of the measured proportions. The resulting proportions lie within a simplex and failing to account for the structure of this simplicial sample space can confound the interpretation of the measurements. As a result, there has been wide discussion across disparate disciplines (1; 2; 3; 4) concerning the reproducibility crisis that has arisen from the misinterpretation of compositional data. One of the obstacles to the appropriate analysis of compositional data is the difficulty of efficiently estimating the latent parameters that lie in the simplex. Accurately scaling probabilistic inference across high-dimensional count data is a major outstanding challenge (5) . This problem is apparent in the social sciences and is particularly pronounced in biological fields where datasets can obtain observations on tens of thousands of features across hundreds or millions of samples. One major computational bottleneck with Gaussian distributed data is the inversion of a d-dimensional covariance matrix that has a runtime of O(d 3 ) (6; 7). As a result, probabilistic covariance estimation for high-dimensional data is a computationally challenging problem. Recent theoretical developments (8) cementing the connection between Variational Autoencoders (VAEs) (9) and Probabilistic Principal Components Analysis (PPCA) (10) holds much promise for enabling accurate, scalable, low-rank approximations of large covariance matrices. Variational autoencoders were originally proposed as a generative model (9) , but are now commonly deployed across scientific disciplines and have made contributions to single-cell RNA sequencing (11) , microbiome modeling (12) , protein modeling (13; 14; 15) , natural language processing (16) and image processing (9) . Following insights that connected regularized linear autoencoders and PCA (17) , Lucas et al. (8) showed that carefully designed VAEs can recover the weights that are solved by PPCA. A computational advantage of VAEs is that they do not require the inversion of a covariance matrix, and the resulting runtime is O(ndkT ) for n samples, d dimensions, k latent dimensions and T epochs. While it has been noted that VAEs may take tens of thousands of epochs to estimate the principal component (18) , VAEs are easily parallelizable and can be accelerated with GPUs, presenting an attractive alternative to estimating principal components (17) and the resulting covariance matrix. The connection between VAEs and PPCA is currently limited to Gaussian distributed data and not well-suited to a compositional setting. Showing that VAEs can recover the correct principal components from count data is nontrivial due to the non-conjugacy issues between the logistic normal distribution and count distributions such as the multinomial distribution. Furthermore, the parameters of the multinomial distribution are compositional; they are constrained within the simplex and the resulting covariance matrix is singular and non-invertible (1; 19) . Aitchison (20) showed that PCA can be adapted to compositional data through the use of the center log-ratio (CLR) transform, which maintains isometry. However, this transformation is not isomorphic, requiring that the resulting log-ratios sum to zero, and as a result, CLR-transformed data will produce a singular covariance matrix and rank-deficient principal components. It has been shown that the isometric log-ratio (ILR) transform (21) satisfies both isomorphism and isometry and can handle this singularity issue (22; 23) while enabling the estimation of full-rank principal components. Here, we show that VAEs augmented with the ILR transform can infer principal components learned from PPCA on multinomially distributed data, beginning to address these critical shortcomings.

3. RELATED WORK

In the microbiome literature, there have been a number of methods (24; 25; 26; 27; 28) that have attempted to model ecological networks through the estimation of pairwise microbe correlations or pairwise inverse-covariance, where microbes are aggregated at different taxonomical scales or 'taxa'. Of these tools, only Flashweave can scale across more than thousands of taxa; however, it does this by avoiding the estimation of the covariance matrix. Methods that attempt to estimate the covariance matrix can only handle on the order of a few thousand dimensions. Although there is no widely accepted consensus definition of Multinomial PPCA in this context, being able to efficiently estimate the parameters of Multinomial PPCA would be highly useful for exploratory biological analysis. A number of studies have proposed using mixture modeling as a proxy for PCA (29; 30; 31) ; however, these techniques depend either on the Dirichlet distribution, whose covariance matrix is not flexible, or on stick-breaking, which violates permutation invariance (32) . Lucas et al. (8) has previously shown that the following two models can obtain the same maximum likelihood estimates of principal components W : Probabilistic PCA p(x|z) = N (W z + µ, σ 2 I d ) p(z) = N (0, I k ) Linear VAE p(x|z) = N (W z + µ, σ 2 I d ) q(z|x) = N (V (x -µ), D) Here, p(x|z) denotes the likelihood of observations x ∈ R d given the latent representation z ∈ R k , p(z) denotes the prior on z and q(z|x) denotes the estimated variational posterior distribution of z given an encoder parameterized by V and diagonal variances D. Both models estimate the same low dimensional representation of the data through z, and learn the same factorization of the covariance matrix through W . While PPCA parameters are typically estimated through expectation maximization (10) , linear VAEs are optimized by maximizing the Evidence Lower Bound (ELBO) given by log p(x) ≥ E q(z|x) log p(x|z) -KL q(z|x) p(z) For linear VAEs with a Gaussian likelihood, the variational posterior distribution q(z|x) can be shown to analytically agree with the posterior distribution p(z|x) learned from PPCA (8) . However, deriving this connection for count-based likelihoods such as the multinomial distribution is complicated due to non-conjugacy issues (Appendix A). This is a major obstacle for many biological applications; multiple works have shown the merits of incorporating count distributions explicitly into the model (33; 34; 35; 36) . Here, we provide directions for overcoming this issue.

4. METHODS

First, we will redefine Multinomial PPCA with the ILR transform (21) . Then we will make the connection between Multinomial VAEs and Multinomial PPCA by leveraging insights from the Collapse-Uncollapse (CU) sampler (33) . We will then derive an algorithm to obtain the maximum a posteriori (MAP) estimate for the VAE parameters.

4.1. PROBABILISTIC MULTINOMIAL PCA

PPCA can be extended to multinomially distributed data with the following generative model: p(x|η) = Mult(φ(Ψη)) p(η|z) = N (W z, σ 2 I d-1 ) (2) p(z) = N (0, I k ) (3) Here W ∈ R d-1×k represents the PCA loading matrix, σ 2 is the variance, Ψ ∈ R d×d-1 is a fixed contrast matrix whose columns sum to zero and φ is the softmax transform. For a single sample, x ∈ N d are the observed d -dimensional counts, η ∈ R d-1 are the latent logits and z ∈ R k is the latent representation. The term φ(Ψη) is distributed logistic normal, φ(Ψη) ∼ LN (W z, σ 2 I), as shown by Aitchison (37) . Furthermore, p(x|z) yields a multinomial logistic normal distribution, which is given by marginalizing out η in the following expression: MLN (x|z) = η p(x|η)p(η|z)dη This integral is not tractable; as a result, this distribution does not have an analytically defined expectation, variance or probability density function. There have been multiple attempts to estimate the posterior distribution with MCMC (38; 39; 35; 40) , but the complexity of this distribution requires a large number of samples, limiting the scalability of these methods. Variational methods have been developed to estimate the logistic normal distribution, but due to conditional non-conjugacy, these methods often rely on approximations to the ELBO, further complicating estimation (41) . Recently, Silverman et al. (33) proposed to use a Laplace approximation to estimate the parameters of the multinomial logistic normal posterior distribution. This approach relies on a two-stage optimization procedure on the factorized posterior distribution given by p(η, z|x) ∝ p(η|x)p(z|η) (4) If η can be directly estimated, then conditional independence can be used to factorize the posterior distribution. Since the probability densities of the multinomial distribution and the normal distribution are both log-convex functions, a global optimum can be obtained for the multinomial logistic normal distribution (33) (Appendix B.3). Furthermore, the multinomial distribution does not introduce additional local optima for estimating Multinomial PCA. Given this, in addition to recent evidence that a PPCA MAP estimator can be obtained from regularized linear autoencoders (17) , we can design a new algorithm to obtain a Multinomial PPCA MAP estimator.

4.2. THE ILR TRANSFORM ENFORCES IDENTIFIABLITY

The softmax function is a shift-invariant function, which introduces an identifiability issue that has been addressed by the compositional data analysis community (1; 42) . In order to remove the identifiability issues, an isomorphism between the logits η and the multinomial parameters must be maintained. One commonly used solution is to use a degenerate softmax, also known as the inverse additive log-ratio (ALR) transform (1) (Appendix B.2). Previous work has suggested that the isometric log-ratio transform (ILR) (21; 22) is more suitable for principal components analysis (Appendix B). The ILR and inverse ILR are given as follows: ILR(x) = Ψ T log x ILR(x) -1 = φ(Ψx) (5) where Ψ ∈ R d×d-1 is a basis such that Ψ T Ψ = I d-1 and ΨΨ T = I d -1 d 1 d×d . A naive implementation of the ILR transform can be memory-intensive and computationally intensive for large d. However, any orthonormal basis can be used to parameterize the ILR transform and some of these bases can be represented by binary trees (43; 44; 45) . A binary tree can be used to represent Ψ with O(d log d) elements, where the lth column vector of Ψ is given as follows: Ψ .l = (0, . . .  where l indexes an internal node in the tree with left children r, right children s, nodes to the left k and nodes to the right t (46) (Figure S2 ). Due to rotation invariance, it doesn't matter which tree is used to parameterize the ILR basis, but the choice of tree can influence the runtime of the ILR transform. If a balanced binary tree is used, the memory requirements representing Ψ can be brought down from O(d 2 ) to O(d log d) and can reduce the matrix vector multplication runtime from O(d 2 ) to O(d log d) (See Appendix B.1). This can speed up the matrix-vector multiplication operations by an order of magnitude for datasets with more than ten thousand dimensions.

4.3. MULTINOMIAL VARIATIONAL AUTOENCODER ARCHITECTURE

Our full Multinomial VAE model is given as follows: p(x|η) = Mult(φ(Ψη)) p(η|z; θ dec ) = N (W z + µ, σ 2 I d-1 ) q(z|x; θ enc ) = N (F L Ψ T ( log(x) -µ) , D) where θ dec = {W , σ 2 } denotes the decoder parameters, θ enc = {F L , D} denotes the encoder parameters and µ ∈ R d-1 is a bias parameter. Here, q(z|x; θ enc ) denotes the variational posterior distribution of z given by the encoder represented as an L-layer dense neural network with appropriate activations. This encoder is directly used to evaluate p(η|z; θ dec ). Furthermore, flat priors are assumed for all variables except z. It is important to note potentially challenging modeling issues when designing the encoder. The ILR transform is not directly applicable to count data, since log(0) is undefined. A common approach to this problem is to introduce a pseudocount before applying a logarithm, which we will denote as log(x) = log(x + 1). The choice of pseudocount is arbitrary and can introduce biases. To alleviate this issue, we introduce the deep encoder neural network highlighted in Equation 10; we expect that the universal approximation theorem would apply here (47; 48) and that the accuracy of estimating the latent representation z will improve with more complex neural networks. This is supported in our simulation benchmarks; more complex encoder architectures can better remove biases induced from the pseudocounts.

4.4. ALTERNATING MINI-BATCH OPTIMIZATION PROCEDURE

Given that our objective here is to obtain the MAP estimate of the VAE model parameters, the VAE parameters θ = {F L , W , D, σ 2 } can be obtained by estimating the global maximum of the posterior distribution. In the original CU sampler implementation, the parameter (η 1 , . . . , η n ) is optimized across the entire dataset and then (η 1 , . . . , η n ) is fixed in order to estimate the remaining parameters. For large studies, this can be memory demanding, since (η 1 , . . . , η n ) ∈ R n×d-1 alone can scale to millions of parameters. To scale this estimation procedure to large high-dimensional datasets we have devised a minibatched alternating minimization procedure. For a mini-batch X (i) = (x (1) , . . . , x (b) ) of size b Algorithm 1 VAE Alternating Maximization Optimization repeat for X (i) ∈ X do Ĥ ← arg max η log p(X (i) |H) ∈ R b×d-1 θ ← arg max θ log q( Ĥ|X (i) ; θ) + log p(Z) end for until convergence and corresponding latent variables Z (i) = (z (1) , . . . , z (b) ) and H (i) = (η (1) , . . . , η (b) ), the quantities log p(X (i) |H (i) ) and log p(Z (i) ) are given by Equation 1 and 3. The variational posterior distribution of η given by q(H (i) |X (i) ; θ) = b j=1 q(η (j) |x (j) ; θ) can be obtained by marginalizing out z in Equations 8, 9 and 10 as follows: q(η (j) |x (j) ; θ) = N W F L Ψ T ( log(x (j) ) -µ) + µ, W DW T + σ 2 I d The prior log p(Z (i) ) is evaluated, given log p(Z (i) ) = log N (Z (i) |0, I k ) and the latent encoding means given by Z (i) = F L Ψ T ( log(X (i) ) -µ) , where Z (i) is integrated out of Equation 11.

5.1. MULTINOMIAL PPCA AND VAES AGREE IN SIMULATION BENCHMARKS

To determine if the proposed multinomial variational autoencoder can recover principal components, we extended the benchmarks proposed in (18) . Here, we benchmarked our proposed analytical VAE to the stochastic VAEs (9) with a multinomial likelihood across multiple simulations. These methodologies were compared against the ground truth covariances of the multinomial logistic normal, in addition to the MAP and Hamiltonian Monte Carlo (HMC) estimates obtained from Stan (49) . The agreement between the ground truth principal components and the estimated principal components is measured by axis-alignment, subspace distance, correlation and Procrustes (50) (see Appendix C.4 for details). When fitting against multinomial logistic normal distributed data with no zero counts, all of the ILRbased methodologies can accurately estimate the covariance matrix up to scale (Figure 1a-d ). The analytical VAE MAP estimate and the posterior samples of HMC all have a correlation close to 1, suggesting a strong agreement between the ground truth covariance and the estimated covariances. If the principal components were perfectly estimated, the axis-alignment, subspace distance and Procrustes metrics would all be close to zero. While the subspace distance and Procrustes are notably close to zero, the axis-alignment metric is above 0.5, which would suggest disagreement between the ground-truth and the estimated principal component axes. Given that the posterior distribution of the axis-alignment metrics obtained from HMC overlaps with the analytical VAE, stochastic VAE and MAP estimates, there is evidence that reducing this metric to zero is inherently difficult for count data. However, both the subspace distance and Procrustes metrics approach zero, supporting our claim that principal components can be estimated up to scale and rotation on fully observed count data. Furthermore, our simulations (Figure 1i -k and Figure 2a-h ) suggest that among the log-ratio transforms benchmarked, only the ILR transform can recover principal components, corroborating previous findings surrounding compositional PCA (42; 23) . Dealing with sparse data presents addi- and (e-h, m-p) respectively. Log-ratio transform benchmarks investigate different log-ratio transforms using a single-layer encoder. The Encoder Depth benchmark investigates the impact of the number of encoder layers combined with the ILR transform. tional challenges; the zeros in count data are indicative of missing data, which can further complicate the estimation of principal components. Our hypothesis that boosting the complexity of the encoder architecture would help alleviate issues with missing data is supported by benchmarks shown in Figure 2i-p. In both of the sparse datasets, none of the methods were able to achieve optimal accuracy across any of the benchmark metrics, but there is a clear advantage of utilizing multilayer encoders compared to single-layer encoders. Across the simulation benchmarks, the analytical and stochastic VAEs have comparable performance, with discrepancies highlighted by the correlation metric in the dense and sparse benchmarks. On the dense datasets, our proposed analytical VAE has better agreement with the ground truth covariance metric, whereas on the sparse datasets, it appears that the stochastic VAE has better agreement. It is difficult to explain the reason for the performance gap between our proposed analytical VAE and the stochastic VAE due the analytical intractability of the Multinomial VAE ELBO. The challenge of accurately estimating an optimal H for each mini-batch could be one limiting factor affecting the performance of our proposed analytical VAE (Appendix B.4).

5.2. PRETRAINING MULTINOMIAL VAES ON A VERY LARGE COMPENDIUM OF MOUSE MICROBIOME OBSERVATIONS

To showcase this on real microbiome datasets, 11k mouse fecal pellet samples with a total of 8.8k features across 45 studies were analyzed. We trained 2 models to evaluate the differences between the stochastic VAE and the analytical VAE. We trained these models for 10k epochs; the models with the smallest validation error are reported here. The details behind the full training procedure are given in Appendix C.3. Visualization of the learned decoder weights shown in Figure 3 makes it clear that the analytical VAE and, to a lesser extent, the stochastic VAE are able to learn orthog-onal embeddings, since W T W approximates a diagonal matrix. The orthogonality of the decoder weights is more apparent in our proposed analytical VAE than in the stochastic VAE decoder; there is a larger distinction between the diagonal elements and the off-diagonal elements. Unlike linear VAEs (8) , the connection between the eigenvalues and the decoder weights is not as clear due to the asymmetry between the encoder and the decoder; scale identifiability of the decoder weights complicates the process of estimating the eigenvalues. This is apparent in Figure 3 ; heatmaps of the W inner products are on vastly different scales. One of the advantages of employing a pretraining strategy with VAEs is that the pretrained models can enable one-shot learning (51) . By estimating a low-dimensional representation from a large unlabeled high-dimensional dataset, fewer labeled samples for training downstream classifiers. We showcase this property in two classification benchmarks.

5.3. CLASSIFICATION AND CORRELATION BENCHMARKS

To construct benchmarks on real biological datasets, we analyzed 16S sequencing data obtained from a study conducted by  (ΨW ) i -(ΨW ) j 2 2 ∝ Var log x i x j where x refers to the observed proportions and i and j refer to the two features being compared. The proportionality metric has a simple interpretation; if this variance is small, that implies that the two features are highly co-occurring. Here, only a relative relationship between the proportionality metric and the VAE decoder row distances can be stated due to scale identifiability issues inherent in Multinomial PPCA. Proportionality has been shown to be a more reliable metric for compositional data compared to standard correlation metrics such as Pearson and Spearman in single cell RNA sequencing and microbiome studies (3; 55; 56; 2), supporting the notion that our proportional comparison of our VAE decoder to Lovell's un-normalized proportionality metric will provide a sound perspective, as well as a means for interpreting the weights of the VAE. We compared the estimated VAE embeddings with the proportionality metric to determine if this relationship holds empirically. The pairwise decoder distances are compared to the pairwise proportionality metrics of 300 selected features (Table 1 and Figure S5 ). The learned VAE representations are benchmarked using K-nearest neighbors (KNN) classification. KNN is applied to the learned VAE encodings across the two microbiome datasets to determine how well the classifiers can identify the mice that were induced with hypoxia (52) and classify mice based on their experimental group (53) . These classification models are compared against two baseline models, namely KNN trained on LDA topic proportions (31) , and KNN trained on raw counts. From the classification benchmarks illustrated in Table 1 , we can see that there is a large margin between the VAE models and the baseline models across the measured metrics. Both VAEs had a higher F1 score and AUC, suggesting that the learned representations can better separate the experimental groups. Part of this performance boost could be attributed to the differences between distance metrics. When the ILR transform is utilized, the Euclidean distance between the latent encodings approximates the Aitchison distance between proportions (57; 23), which measures relative differences instead of absolute differences. Our findings collaborate theoretical insights from compositional data analysis and empirical microbiome observations, where the Aitchison distance provided a substantial boost in KNN classification accuracy compared to other distance metrics (58; 59). Furthermore, the average negative log-likelihood score (NLL) is lower for both of the VAE models compared to LDA, suggesting that the VAE models generalized better on held out data than LDA. This decrease in the predictive log-likelihood could be attributed to the increased flexibility of the covariance matrix in the logistic normal distribution compared to the Dirichlet distribution. There are a couple of notable discrepancies between the two Multinomial VAE models. pared to the analytical VAE. The exception to this is the HCC dataset (53) , where the analytical VAE marginally outperforms the stochastic VAE in terms of AUC. However, the analytical VAE can learn more apparent orthogonal embeddings (Figure 3 ) and better agrees with Lovell's proportionality metric (Table 1 ). Only the analytical VAE was able to recover the log-linear relations between Lovell's proportionality and the VAE embedding distances, suggesting that it can more accurately learn biologically relevant correlations (Figure S5 ).

CONCLUSION

Prior work aiming to build frameworks for probabilistic estimation of covariance matrices from count data have been largely limited to conjugate priors due to their tractability. However, these choices can lead to models with lower explanatory power due to the rigid structure of the resulting covariances. Further, the correct treatment of compositional data requires additional development in this context. For example, disregarding the simplicial sample space associated with compositional data and performing Pearson correlations on raw count data is common practice in scientific applications, but is a source of reproducibility issues (2) . Due to the negative bias induced from the covariance on the observed count data, the estimated covariances will not agree with the ground truth covariances in the system of interest, a fact noted by Pearson in 1897 (60) . Adapting the logistic normal distribution in place of conventional conjugate priors provides a means to remedy the issue of inferring correlations on compositional data. The covariance matrix on ILRtransformed data can be interpreted using Lovell's proportionality metric, and can serve as a replacement for pairwise correlations. Since log-ratios are scale-invariant, the dependence on the total counts disappears, which is critical for assuring agreement between the relative quantities measured through count data and the absolute quantities in the system that aren't directly observable. While there is sound motivation to employ the logistic normal distribution in this context, its application has been limited due to the challenge of estimating these distributions. Here, we have provided a means to accurately and efficiently estimate these distributions and covariance matrices up to scale using the ILR transform. To this end, we have provided a proof-of-concept that Multinomial VAEs can learn the principal components obtained from Multinomial PPCA. We have shown that fitting low-rank approximations using Multinomial VAEs can provide more explanatory representations than LDA while providing a means to obtain interpretable correlations. These methods can be used in one-shot learning and transfer learning settings, requiring fewer labeled samples for training and allowing for the use of pretrained models to extract features from smaller datasets. Given the vast number of scientific disciplines that collect compositional data, we anticipate that models such as these Multinomial VAEs will have a significant impact on many scientific applications.

A CHALLENGES IN DERIVATION OF AN ANALYTICAL MULTINOMIAL VAE ELBO

Recall that the generative model for Multinomial PPCA is given as follows: p(x|η) = Mult(x|φ(Ψη)) p(η|z) = N (η|W z + µ, σ 2 I d-1 ) (14) p(z) = N (z|0, I k ) With this in mind, we wish to estimate variational distributions q(η, z|x) = q(η|z)q(z|x) to approximate the posterior p(η, z|x). These variational distributions can both be chosen to be normal distributions as follows: q(z|x) = N (V Ψ T ( log(x) -µ), D) q(η|z) = p(η|z) Noting that z ∼ N (V Ψ T ( log(x) -µ), D), q(η|x) can be derived from q(z|x) as follows: q(η|x) = N W V Ψ T ( log(x)) -µ) + µ, W DW + σ 2 I To fine-tune these variational distributions to approximate the posterior distribution, we can minimize the following KL divergence: arg max q(η,z|x) KL(q(η, z|x) p(η, z|x)) Since we cannot optimize this quantity directly, we opt instead to maximize the evidence lower bound (ELBO) given by E q(η,z|x) log p(η, z|x) q(η, z|x) ≥ E q(η,z|x) log p(x|η)p(η|z)p(z) q(η, z|x) We can partition this lower bound into three parts, given as follows: = E q(η|x)q(z|x) [log p(x|η)] (i) + E q(η|x) log p(η|z) q(η|x) (ii) + E q(z|x) log p(z) q(z|x) (iii) Since Mult(x|p) ∝ d i=1 x i log p i , the first term is given by E q(η|x)q(z|x) log p(x|η) ∝ E q(η|x) d i=1 x i φ(Ψη) i ∝ η N W V Ψ T ( log(x) -µ), W DW + σ 2 I d i=1 x i φ(Ψη) i dη Estimating the above integral is equivalent to estimating the expectation of a logistic normal distribution, which does not have an analytical solution (37) . As a result, the analytical ELBO for the Multinomial VAE is intractable.

B THE ILR TRANSFORM

As outlined in Equation 5, any orthogonal basis Ψ can be used to perform the ILR transform. However, there are a select few bases that can be represented by a binary tree. To see how an orthogonal basis can be constructed from a binary tree, consider the illustration in Figure S1 . Figure S1: A small example showing how an orthogonal basis can be constructed from a binary tree. x = (x 1 , x 2 , x 3 , x 4 ) ∈ S 4 represents species proportions and η = (η 1 , η 2 , η 3 ) ∈ R 3 represents the log-ratios in the internal nodes. Ψ is a matrix of orthogonal contrasts that also can be represented from a binary tree. Here the rows of the matrix in Figure S1 are orthogonal and the resulting product η = Ψ T log x yields η 1 = log x 1 x 2 x 3 x 4 η 2 = log x 2 x 3 x 4 η 3 = log x 3 x 4 The contrast matrix Ψ can be forced to be orthonormal such that Ψ T Ψ = I d-1 , as highlighted in Equation 6. Furthermore, this construction can be scaled to large binary trees as shown in Figure S2 . Here, η l represents the log-ratios at the internal node l, given by η l = |r||s| |r| + |s| log g(x r ) g(x s ) The runtime of this operation is further discussed in Appendix B.1.

B.1 THE RUNTIME OF THE ILR TRANSFORM

The naive runtime of the ILR transform of a single sample x ∈ R d is O(d 2 ) due to the running time of dense matrix-vector multiplication. As shown in Equation 5, the ILR transform can also be represented by log-linear transformation with a contrast matrix. The binary tree can be used to represent a contrast matrix, as discussed in (46) . If the binary tree is balanced, each row of Ψ will have O(log d) non-zero elements, since the tree has a height of O(log d). Given that there are d rows, the matrix-vector multiplication behind the inverse ILR transform Ψη can be done in O(d log d). For the same reason, the ILR transform has a runtime of O(d log d). Figure S2 : An illustration of how the ILR basis can be constructed on large trees. The quantities g(x r ) and g(x s ) yield the geometric means within a vector of proportion x for subsets x r and x s . Here, r and s refer to the sets of features in the left and right subtrees for the internal node l. The log-ratios η can be obtained from either Equation 6or Equation 16.

B.2 SHIFT INVARIANCE OF THE SOFTMAX TRANSFORM

There are two different scale identifiability issues. The softmax transform that is commonly used is shift invariant, where φ(x) ← φ(x + a) ∀a ∈ R This shift invariance will cause an identifiability issue when identifying the decoder matrix W . The inverse ALR transform resolves this issue by enforcing an isomorphism; one of the coordinates is set to zero as follows: ALR(x) = log x 1 x d , . . . , log x d-1 x d , 0 , ALR -1 (x) = φ (x 1 , . . . , x d-1 , 0) One of the implicit constructions of this transform is that the resulting contrast matrix is not orthogonal (42) . Furthermore, because the ALR transform is not isometric, the resulting Euclidean distances in z will not approximate the Aitchison distance. The ILR transform provides the best of both worlds, enforcing both isometry and isomorphism. Furthermore, there is a tight connection between the ILR transform and compositional PCA that is further discussed in the next section.

B.3 GLOBAL LOG-CONVEXITY OF THE MULTINOMIAL LOGISTIC NORMAL DISTRIBUTION

Since the multinomial logistic normal distribution is difficult to directly evaluate, it is challenging to make statements regarding its maximum likelihood estimator. With the posterior factorization highlighted in Equation 4, we can make concrete statements about the original posterior factors. The log probability density of the multinomial distribution p(x|η) can be written as log p(x|η) ∝ d i=1 x i log φ(Ψη) i = d i=1 x i (Ψη) i -m log d j=1 exp(Ψη) i = g(η) • T (x) -A(η) where m = d i=1 x i is the total number of counts and the functions g(η) i = (Ψη) i , T (x) i = x i , and A(η) = log d j=1 exp(Ψη) i are the natural parameters of the exponential family distribu-tion. The Hessian of log p(x|η) is given by d 2 log p(x|η) dη i dη j = d 2 A(η) dη i dη j Since A(η) is strictly convex, log p(x|η) is also strictly convex. Similarly, the log probability density of the multivariate Gaussian distribution p(z|η) = N (µ, σ) is also strictly convex with respect to µ and Σ. Since the multinomial logistic normal distribution can be written as the sum of log p(z|η) and log p(x|η), it is also strictly convex. Therefore, there must be a unique optimal estimate for µ and Σ.

B.4 MULTINOMIAL VARIATIONAL AUTOENCODER ESTIMATION

The covariance matrix W DW T + σ 2 I d-1 ∈ R d-1×d-1 in Equation 11 can be efficiently inverted using the Woodbury identity (67; 68). The prior log p(Z (i) ) is evaluated given log p( Z (i) ) = N (Z (i) |0, I k ), where Z (i) = F L Ψ T ( log(X (i) ) -µ) . The optimal Ĥ(i) that maximizes log p(X (i) |H (i) ) is obtained through gradient descent optimization. Once Ĥ(i) is obtained, the remaining VAE parameters θ can be estimated via gradient descent optimization. Like the EM algorithm proposed in (69) , this procedure alternates between estimating H (i) and θ and repeats until convergence. For a single-layer linear encoder, this optimization procedure will eventually reach the global maxima with respect to H (i) and the multivariate normal mean and covariance in q(H (i) |θ, X (i) ), due to the log-convexity of the multinomial and normal distributions (Appendix B.3). On fully observed data, W and D can be estimated up to rotation and scale (Appendix B.6). This is not guaranteed for sparse count data, but using simulations, we can empirically show that increasing the complexity encoder architectures can help. Since Ĥ(i) needs to be accurately estimated before optimizing F L , W , σ, and D, multiple gradient descent steps are required for a given mini-batch. In practice, obtaining the optimal Ĥ(i) for a given mini-batch i may require hundreds of gradient descent steps from a random initialization. Since H (i) is used to approximate the multinomial parameters describing the observed counts x, we can initialize H (i) with Ĥ(i) 0 = Ψ T log(X (i) ). In practice, this can greatly reduce the number of gradient descent updates needed per mini-batch.

B.5 THE CONNECTION BETWEEN MULTINOMIAL VAES AND COMPOSITIONAL PCA

With singular value decomposition, the resulting factors can be used to approximate the row and column distances. For a singular value decomposition given by X = U SV T , row distances and column distances can be approximated as follows: x i. -x j. 2 ≈ s i u i -s j u j 2 x .i -x .j 2 ≈ s i v i -s j v j 2 This relationship also yields a connection to the row and column covariances; the row covariances are given by X T X = U S 2 U and the column covariances are given by XX T = V S 2 V . A similar relationship applies to compositional PCA, except the singular value decomposition is applied to CLR-transformed values of X. The CLR transform is given by CLR(x) = log x 1 g(x) , . . . , log x d g(x) CLR -1 (x) = φ(x) The inverse CLR transform is equivalent to the softmax transform φ(x). The Euclidean distance on CLR and ILR transformed values is given by the Aitchison distance (57) . As a result, the row and column distances are given by the Aitchison distance. The covariance matrix of CLR-transformed values is singular, and as a result, the resulting singular value decomposition will have a full rank of d-1. As shown by Egozcue et al. (23) , the top d-1 components of the singular value decomposition are all in ILR coordinates. This observation cements the connection between compositional PCA and our proposed Multinomial VAE; since all of the singular value components can be represented in ILR coordinates, this theoretically justifies the use of the ILR transform within our proposed Multinomial VAE. Due to this connection, the distances between the learned VAE representations z across pairs of samples should be proportional to the Aitchison distance. Furthermore, the distances between rows of ΨW would also be given by the Aitchison distance, which is equivalent to Lovell's un-normalized proportionality constant (23) .

B.6 LACK OF SCALE IDENTIFIABILITY OF PPCA

As discussed in (8), VAEs will have the same identifiability issues that PPCA has; namely, for a diagonal matrix A ∈ R k×k , the following equivalences hold: W ← W A, V ← A -1 V As a result, the decoder weights W can only be identified up to scale. Furthermore, the Multinomial VAE architecture is asymmetric and, as a result, the Tranpose Theorem proposed by ( 17) is not as readily applicable. The lack of symmetry between W and V complicates the process of obtaining eigenvalues from W .

C SIMULATION AND MICROBIOME BENCHMARKS C.1 SIMULATION DETAILS

In the dense simulation highlighted in Figure 1 , there were 100 features, 1000 samples and 1 million counts per sample. These counts were drawn from a multinomial logistic normal distribution whose covariance matrix had a rank of 10. The sparse dataset highlighted in Figure 2a -h contained 200 dimensions and 1000 samples, each of which contained 100 counts that were sampled from a multinomial logistic normal distribution whose covariance matrix had a rank of 10. The sparse dataset highlighted in Figure 2i -p contained 5000 dimensions and 10000 samples, each of which contained only 1000 counts total. These counts were sampled from a multinomial logistic normal distribution whose covariance matrix had a rank of 128. The encoder architecture used for the stochastic and analytical VAEs consisted of fully connected dense layers with interwoven softplus activation functions. The intermediate encoder layers all have the same input and output dimension as the number of latent dimensions specified in the model. The same architecture was used for both the stochastic VAEs and the analytical VAEs. A schematic of the architecture is shown in Figure S3 . Multinomial PCA was fitted using both penalized likelihood estimation and Hamiltonian Monte Carlo using Stan (Appendix C.4). The initialization of these models was determined by the estimated factor loadings from the analytical VAE. This is particularly critical for fitting HMC on high-dimensional datasets. We were not able to run Stan on the sparse datasets.

C.2 EXPERIMENTAL DATASET DETAILS

The full 8.8k mouse pellet dataset was retrieved from qiita (70) using redbiom (71) . In the hypoxia study conducted by Tripathi et al., mice were placed in simulated conditions to induce intermittent hypoxic/hypercapnic (IHH) stress, where the mouse's oxygen supply was reduced and its CO 2 supply was increased. The goal of this experiment was to detect the gut microbiome differences between the IHH mice and the control mice. We will use this information as a classification benchmark to determine if the experimental conditions the mice were placed under can be predicted from 16S sequencing counts. In total, there were 48 mice; 24 of these mice were placed in these IHH conditions and the remaining 24 mice served as controls. Fecal samples were collected twice a week for 6 weeks, resulting in 579 samples total. There were 5775 microbial taxa that were detected in this dataset. The goal of the hepatocellular carcinoma (HCC) study conducted by Shalapour was to investigate the interplay between immunity, diet and carcinoma. In total, there were 478 mice with 52 phenotypes, split according to diet, immunity status and whether or not the mice were induced with HCC. Our Figure S3 : Visualization of VAE architecture classification objective is to predict the mouse phenotype from the 16S sequencing counts obtained from their fecal pellets. In total, there was one fecal pellet collected from each mouse and there were 2794 microbial taxa detected across these samples. This benchmark is more challenging than the IHH benchmark due to the extreme class imbalance; some phenotypes contain upwards of 30 mice and other contain as few as 2 mice (Figure S4 ). The resulting classification benchmarks are shown in Table 1 .

C.3 MICROBIOME VAE TRAINING DETAILS

For the full 11k fecal pellet dataset, we focused on samples that were processed using closedreference OTU picking using vsearch (72) . This dataset was split into a 80/10/10 train/validation/test split in order measure out-of-distribution generalizability. We trained our proposed analytical VAE and the stochastic VAE with 128 latent dimensions and a batch size of 1000 samples for 10000 epochs, with a learning rate of 10 -3 and 100 gradient descent steps per mini-batch. Checkpoints were recorded every epoch; the models with the best validation error are reported. Cosine annealing with warm restarts was used as a learning rate scheduler, with the intention of easily escaping saddle points during optimization.

C.4 METRICS FOR EVALUTING ESTIMATED DECODER WEIGHTS

To measure the agreement between the ground truth simulations and the estimated decoder weights W , three different metrics were evaluated, namely, axis alignment and subspace distance, as proposed in (18) , in addition to Procrustes analysis (50) . Given the decoder weights W and the ground truth principal components U , these metrics are defined as follows: Axis alignment d(U , W ) = 1 - 1 k k i=1 max j (U T i W j ) 2 |U i | 2 2 |W j | 2 2 This metric is ultimately a measure of the average cosine distance between the ground truth eigenvectors and the estimated decoder weights. 



, a, . . . a r , b, . . . , b s , 0, . . . , 0 t ) (6) a = |s| |r|(|r| + |s|) b = -|r| |s|(|r| + |s|)

Figure 1: Dense simulation benchmarks. (a-d) Comparision of ground truth covariances (y-axis) and estimates from analytical VAE, stochastic VAE, MAP estimator and HMC performed in Stan (x-axes); all of these methods use the ILR transform and only the posterior means are shown in (d). (e-h) Correlation, axis-alignment, subspace distance and Procrustes metrics between analytical VAEs, stochastic VAEs, MAP and HMC after convergence. (i-k) Comparison of log, ALR and ILR transforms with respect to axis-alignment, subspace distance and Procrustes. The axis-alignment metric measures the angular differences between W and the ground truth principal components. The subspace distance is a measure of agreement between the ground truth correlations and the correlations estimated from W . Procrustes measures the residual error after obtaining the best rotation and scaling factors to match W to the ground truth principal components. See Appendix C.5 for descriptions of the correlation, axis-alignment, subspace-distance and Procrustes metrics.

Figure 2: Comparison of different log-ratio transforms (a-f) and encoder depths (g-l) in sparse simulations. Two datasets with 200 and 5000 input dimensions are simulated for benchmarks (a-d, i-l)and (e-h, m-p) respectively. Log-ratio transform benchmarks investigate different log-ratio transforms using a single-layer encoder. The Encoder Depth benchmark investigates the impact of the number of encoder layers combined with the ILR transform.

Figure 3: Comparison of the decoder inner product between the pretrained stochastic VAE and the pretrained analytical VAE. (a-b) Heatmap of W T W . (c-d) Distribution of diagonal elements and off-diagonal elements of W T W .

Figure S4: Distribution of 52 phenotypes in the Shalapour et al. study, highlighting the class imbalance inherent in this benchmark.

Figure S5: Comparison of Lovell's proportionality metric and the pairwise embedding distances for the stochastic VAE and the analytical VAE evaluated on the Shalapour et al. dataset and the Tripathi et al. dataset. The embedding distance is defined by the Euclidean distance between two rows within ΨW . Since the proportionality metric is not defined for zeros, it is only evaluated for samples where both taxa are observed. Only the top 300 most abundant microbes are visualized here.

Tripathi et al. investigating hypoxia that used mouse fecal pellets (52) and a study conducted by Shalapour et al. investigating cancer effects on mice (53). Due to the connections between Multinomial VAEs and compositional PCA (20; 22), we expect that, if our VAE is performing well, we will see that the row Euclidean distances of ΨW ∈ R d×k correspond to Lovell's un-normalized proportionality metric (54; 23) given by

Classification and Correlation benchmarks. KNN classification was performed with k=5 for all representations. The average negative log-likelihood (NLL) was evaluated on samples that were held out during the training of the VAEs. 100 rounds of 10-fold cross validation were applied to estimate average F1 score, AUC and standard errors. "Raw" indicates that no transformation was applied to the raw counts before performing KNN classification. The agreement between the VAE decoder weights and Lovell's proportionality metric is measured according to Pearson's r on the log-transformed metrics.

Classification on held out data benchmarks. KNN classifiers were trained on data observed by the VAE, and evaluated on 1149 held out test samples. KNN with k=1 was used across all representations. Predictive log probability is also added to measure the generalizability of the VAEs across held out samples (smaller values are better).Due to the relationship between squared Euclidean distance and proportionality given in Equation12, we expect the proportionality metric to exhibit a log-linear relationship between the Euclidean distances obtained from the VAE embeddings. Indeed, this is apparent with the analytical VAE embeddings in Shalapour et al., and to a lesser extent in Tripathi et al., as shown in FigureS5.

ACKNOWLEDGEMENTS

We want to acknowledge Pytorch (61) , Pytorch-Lightning (62), the Biom format (63) , Matplotlib (64) , Scipy (65) , Numpy (66) and Stan (49) for providing the software foundation that this work was built upon.

CODE AVAILABILITY

All of our software and analyses can be found on Zenodo at http://doi.org/10.5281/zenodo.4289004 

annex

Subspace distance d(U , W ) = 1 -Tr(U U T W * W T * ) where W * denotes the left singular vectors of W , given by W = W * ΛV . Since U U T yields the ground truth correlations and W * W T * yields the estimated correlations, this metric can be interpreted as a measure of agreement between the ground truth correlations and the estimated correlations.

Procrustes Analysis

for A, R ∈ R k×k , where A is a diagonal matrix and R is a rotation matrix. Prior to evaluating this metric, both U and W are standardized such that Tr(U U T ) = Tr(W W T ) = 1 and W is centered around the origin.where 

