FACTORIZED LINEAR DISCRIMINANT ANALYSIS FOR PHENOTYPE-GUIDED REPRESENTATION LEARNING OF NEURONAL GENE EXPRESSION DATA

Abstract

A central goal in neurobiology is to relate the expression of genes to the structural and functional properties of neuronal types, collectively called their phenotypes. Single-cell RNA sequencing can measure the expression of thousands of genes in thousands of neurons. How to interpret the data in the context of neuronal phenotypes? We propose a supervised learning approach that factorizes the gene expression data into components corresponding to individual phenotypic characteristics and their interactions. This new method, which we call factorized linear discriminant analysis (FLDA), seeks a linear transformation of gene expressions that varies highly with only one phenotypic factor and minimally with the others. We further leverage our approach with a sparsity-based regularization algorithm, which selects a few genes important to a specific phenotypic feature or feature combination. We applied this approach to a single-cell RNA-Seq dataset of Drosophila T4/T5 neurons, focusing on their dendritic and axonal phenotypes. The analysis confirms results obtained by conventional methods but also points to new genes related to the phenotypes and an intriguing hierarchy in the genetic organization of these cells.

1. INTRODUCTION

The complexity of neural circuits is a result of many different types of neurons that specifically connect to each other. Each neuronal type has its own phenotypic traits, which together determine the role of the neuronal type in a neural circuit. Typical phenotypic descriptions of neurons include features such as dendritic and axonal laminations, electrophysiological properties, and connectivity (Sanes & Masland, 2015; Zeng & Sanes, 2017; Gouwens et al., 2019) . However, the genetic programs behind these phenotypic characteristics are still poorly understood. Recent progress in characterizing neuronal cell types and investigating their gene expression, especially with advances in high-throughput single-cell RNA-Seq (Zeng & Sanes, 2017) , provides an opportunity to address this challenge. With massive data generated from single-cell RNA-Seq, we now face a computational problem: how to factorize the high-dimensional data into gene expression modules that are meaningful to neuronal phenotypes? Specifically, given phenotypic descriptions of neuronal types, such as their dendritic stratification and axonal termination, can one project the original data into a low-dimensional space corresponding to these phenotypic features and their interactions, and further extract genes critical to each of these components? Here we propose a new analysis method named factorized linear discriminant analysis (FLDA). Inspired by multi-way analysis of variance (ANOVA) (Fisher, 1918) , this method factorizes data into components corresponding to phenotypic features and their interactions, and seeks a linear transformation that varies highly with one specific factor but not with the others. The linear nature of this approach makes it easy to interpret, as the weight coefficients directly inform the relative importance of each gene to each factor. We further introduce a sparse variant of the method, which constrains the number of genes contributing to each linear projection. We illustrate this approach by applying FLDA to a single-cell transcriptome dataset of T4/T5 neurons in Drosophila (Kurmangaliyev et al., 2019) , focusing on two phenotypes: dendritic location and axonal lamination. j = 0 j = 1 j = 2 j = 3 i = 0 n00 n01 n02 n03 i = 1 n10 n11 n12 n13 i = 2 n20 n21 n22 n23 Phenotypic feature 2 Phenotypic feature 1

B

Partial table j = 0 j = 1 j = 2 j = 3 i = 0 n00 n02 n03 i = 1 n11 n12 n13 i = 2 n20 n21 n23 Phenotypic feature 2 Phenotypic feature 1 C i = 0 j = 1 i = 1 j = 0 i = 1 j = 1 z: representing Feature 2 y: representing Feature 1 v w u s: representing the interaction of both features i = 0 j = 0 Figure 1: Illustration of our approach. (A,B) In the example, cell types are jointly represented by two phenotypic features, indexed with labels i and j respectively. If only some combinations of the two features are observed, one obtains a partial contingency table (B) instead of a complete one (A). (C) We seek linear projections of the data that separate the cell types in a factorized manner corresponding to the two features. Here u, v, and w are aligned with Feature 1, Feature 2, and the interaction of both features, with the projected coordinates y, z, and s respectively. 2 FACTORIZED LINEAR DISCRIMINANT ANALYSIS (FLDA) Suppose that we are given gene expression data of single neurons which are typically very highdimensional. These cells are classified into cell types, as a result of clustering in the highdimensional space and annotations based on prior knowledge or verification outcome (Macosko et al., 2015; Tasic et al., 2016; Shekhar et al., 2016; Tasic et al., 2018; Peng et al., 2019) . We know the phenotypic traits of each neuronal type, therefore each type can also be jointed defined by the phenotypic features. We want to find an interpretable low-dimensional embedding in which certain dimensions represent factors of phenotypic features or their interactions. This requires that variation along one of the axes in the embedding space causes the variation of only one factor. In reality, this is hard to satisfy due to noise in the data, and we relax the constraint by letting data projected along one axis vary largely with one factor while minimally with the others. In addition, we ask that cells classified as the same type are still close to each other in the embedding space, while cells of different types are far apart. As a start, let us consider only two phenotypic features of neurons, dendritic stratification, and axonal termination, both of which can be described with discrete categories, such as different regions or layers in the brain (Oh et al., 2014; Euler et al., 2014; Sanes & Masland, 2015; Kurmangaliyev et al., 2019) . Suppose that each cell type can be jointly represented by its dendritic location indexed as i and axonal lamination indexed as j, with the number of cells within each cell type n ij . This representation can be described using a contingency table (Figure 1A, B) . Note here that we allow the table to be partially filled. Let x ijk (k ∈ 1, 2, ...n ij ) represent the expression values of g genes in each cell (x ijk ∈ R g )). How to find linear projections y ijk = u T x ijk and z ijk = v T x ijk that are aligned with features i and j respectively (Figure 1C )? We first asked whether we could factorize, for example, y ijk , with respect to components depending on features i and j. Indeed, motivated by the linear factor models used in multi-way ANOVA and the idea of partitioning variance, we constructed an objective function as the following, and found u * that maximizes the objective (see detailed analysis in Appendix A): u * = arg max u∈R g u T N A u u T M e u When we have a complete table, and there are a levels for the feature i and b levels for the feature j, we have N A = M A -λ 1 M B -λ 2 M AB (2) where M A , M B , and M AB are the covariance matrices explained by the feature i, the feature j, and the interaction of them. λ 1 and λ 2 are hyper-parameters controlling the relative weights of M B and M AB with respect to M A . M e is the residual covariance matrix representing noise in gene expressions. Formal definitions of these terms are the following: M A = 1 a -1 a i=1 b j=1 (m i. -m .. )(m i. -m .. ) T (3) M B = 1 b -1 a i=1 b j=1 (m .j -m .. )(m .j -m .. ) T (4) M AB = 1 (a -1)(b -1) a i=1 b j=1 (m ij -m i. -m .j + m .. )(m ij -m i. -m .j + m .. ) T (5) M e = 1 N -ab a i=1 b j=1 [ 1 n ij nij k=1 (x ijk -m ij )(x ijk -m ij ) T ] where m .. = 1 ab a i=1 b j=1 m ij m i. = 1 b b j=1 m ij m .j = 1 a a i=1 m ij (9) in which m ij = 1 n ij nij k=1 x ijk (10) An analogous expression provides the linear projection v * for the feature j, and w * for the interaction of both features i and j. Similar arguments can be applied to the scenario of a partial table to find u * or v * as the linear projection for the feature i or j (see Appendix B for mathematical details). Note that N A is symmetric and M e is positive definite. Therefore the optimization problem is a generalized eigenvalue problem (Ghojogh et al., 2019) . When M e is invertible, u * is the eigenvector associated with the largest eigenvalue of M -1 e N A . In general, if we want to embed x ijk into a ddimensional subspace aligned with the feature i (d < a), we can take the eigenvectors with the d largest eigenvalues of M -1 e N A , which we call the top d factorized linear discriminant components (FLDs). Since multi-way ANOVA can handle contingency tables with more than two dimensions, our analysis can be easily generalized to more than two features.

3. SPARSITY-BASED REGULARIZATION OF FLDA

For this domain-specific application in neurobiology, there is particular interest in finding a small group of genes that best determine one of the phenotypic features. This leads to finding axes that have only a few non-zero components. To identify such a sparse solution, we solved the following optimization problem: u * = arg max u∈R g u T N A u u T M e u subject to ||u|| 0 ≤ l (11) from which the number of non-zero elements of u * is less or equal to l. This is known as a sparse generalized eigenvalue problem, which has three unique challenges, as listed in Tan et al. (2018) : first, when the data are very high-dimensional, M e can be singular and non-invertible; second, because of the normalization term u T M e u, many solutions for sparse eigenvalue problems cannot be applied directly; finally, this problem involves maximizing a convex objective over a nonconvex set, which is NP-hard. To solve it, we used truncated Rayleigh flow (Rifle), a method specifically developed to solve sparse generalized eigenvalue problems. The algorithm of Rifle is composed of two steps (Tan et al., 2018) : first, to obtain an initial vector u 0 that is close to u * . We used the solution from the non-sparse FLDA as an initial estimate of u 0 ; second, iteratively, to perform a gradient ascent step on the objective function, and then execute a truncation step that preserves the l entries of u with the largest values and sets the remaining entries to 0. Pseudo-code for this algorithm can be found in Appendix C. As proved in Tan et al. (2018) , if there is a unique sparse leading generalized eigenvector, Rifle will converge linearly to it with the optimal statistical rate of convergence. The computational complexity of the second step is O(lg + g) for each iteration, therefore Rifle scales linearly with g, the dimensionality of the original data. Based on the theoretical proof, to guarantee convergence, the hyperparameter η was selected to be sufficiently small such that ηλ max (M e ) < 1, where λ max (M e ) is the largest eigenvalue of M e . In our case, the other hyperparameter l, indicating how many genes to be preserved, was empirically selected based on the design of a follow-up experiment. As mentioned later in Results, we chose l to be 20, a reasonable number of candidate genes to be tested in a biological study.

4.1. LINEAR DIMENSIONALITY REDUCTION

FLDA is a method for linear dimensionality reduction (Cunningham & Ghahramani, 2015) . Formally, linear dimensionality reduction is defined as the following: given n data points each of g dimensions, X = [x 1 , x 2 , ..., x n ] ∈ R g×n , and a choice of reduced dimensionality r < g, optimize an objective function f (.) to produce a linear projection P ∈ R r×g , and Y = P X ∈ R r×n is the low-dimensional transformed data. State-of-the-art methods for linear dimensionality reduction include principal component analysis (PCA), factor analysis (FA), linear multidimensional scaling (MDS), linear discriminant analysis (LDA), canonical correlations analysis (CCA), maximum autocorrelation factors (MAF), slow feature analysis (SFA), sufficient dimensionality reduction (SDR), locality preserving projections (LPP), and independent component analysis (ICA) (Cunningham & Ghahramani, 2015) . Some of them are obviously unsuitable to the problem we want to solve, for example, MAF and SFA are developed for data with temporal structures (Larsen, 2002; Wiskott & Sejnowski, 2002) , and LPP focuses on the local structures of the data instead of their global organization (He & Niyogi) . The remaining approaches can be roughly grouped for either unsupervised or supervised linear dimensionality reduction, and we discuss them separately.

4.2. UNSUPERVISED METHODS FOR LINEAR DIMENSIONALITY REDUCTION

Unsupervised linear dimensionality reduction, including PCA (Jolliffe, 2002) , ICA (Hyvärinen et al., 2001) , FA (Spearman, 1904) and more, project data into a low-dimensional space without using supervision labels. They are not suitable to solve our problem, due to the specific characteristics of the gene expression data. The dimensionality of gene expression data is usually very high, with tens of thousands of genes, and expressions of a fair number of them can be very noisy. These noisy genes cause large variance among individual cells, but in an unstructured way. Without supervision signals from the phenotypic features, unsupervised methods tend to select these genes to construct the low-dimensional space, which offers neither the desired alignment nor a good separation of cell type clusters. To illustrate this, we performed PCA on the gene expression data and compared it with FLDA. Briefly, we solved the following objective to find the linear projection: u * = arg max u∈R g u T XX T u u T u (12) The outcome of this comparison is shown in Results.

4.3. SUPERVISED METHODS FOR LINEAR DIMENSIONALITY REDUCTION

Supervised linear dimensionality reduction, represented by LDA (Fisher, 1936; McLachlan, 2004) and CCA (Hotelling, 1936; Wang et al., 2016) , can overcome the above issue. By including supervised signals of phenotypic features, we can devalue genes whose expressions are non-informative about the phenotypes.

4.3.1. LINEAR DISCRIMINANT ANALYSIS (LDA)

We name our method FLDA because its objective function has a similar format as that of LDA. LDA also models the difference among data organized in pre-determined classes. Formally, LDA solves the following optimization problem: u * = arg max u∈R g u T Σ b u u T Σ e u where Σ b and Σ e are estimates of the between-class and within-class covariance matrices respectively. Different from FLDA, the representation of these classes is not explicitly formulated as a contingency table composed of multiple features. The consequence is that, when applied to the example problem in which neuronal types are organized into a two-dimensional contingency table with phenotypic features i and j, in general, axes from LDA are not aligned with these two phenotypic features. However, in the example above, we can perform two separate LDAs for the two features. This allows the axes from each LDA to align with its specific feature. We call this approach "2LDAs". There are two limitations of this approach: first, it discards information about the component depending on the interaction of the two features which cannot be explained by a linear combination of them; second, it explicitly maximizes the segregation of cells with different feature levels which sometimes is not consistent with a good separation of cell type clusters. Detailed comparisons between LDA, "2LDAs" and FLDA can be found in Results.

4.3.2. CANONICAL CORRELATION ANALYSIS (CCA)

CCA projects two datasets X a ∈ R g×n and X b ∈ R d×n to Y a ∈ R r×n and Y b ∈ R r×n , such that the correlation between Y a and Y b is maximized. Formally, it tries to maximize this objective: (u * , v * ) = arg max u∈R g ,v∈R d u T (X a X T a ) -1 2 X a X T b (X b X T b ) -1 2 v (u T u) -1 2 (v T v) -1 2 (14) To apply CCA to our problem, we need to set X a to be the gene expression matrix, and X b to be the matrix of d phenotypic features (d = 2 for two features as illustrated later). In contrast with FLDA, CCA finds a transformation of gene expressions aligned with a linear combination of phenotypic features, instead of a factorization of gene expressions corresponding to individual phenotypic features. This difference is quantified and shown in Results. 

5.1. DATASETS

In order to quantitatively compare FLDA with other linear dimensionality reduction methods, such as PCA, CCA, LDA, and the "2LDAs" approach, we created synthetic datasets. Four types of cells, each containing 25 examples, were generated from a Cartesian product of two features i and j, organized in a 2x2 complete contingency table. Expressions of 10 genes were generated for these cells, in which the levels of Genes 1-8 were correlated with either the feature i, the feature j, or the interactions of them, and the levels of the remaining 2 genes were purely driven by noise (Figure 2A ). Details of generating the data can be found in Appendix D. To illustrate FLDA in analyzing single-cell RNA-Seq datasets for real problems of neurobiology, and demonstrate the merit of our approach in selecting a few important genes for each phenotype, we used a dataset of Drosophila T4/T5 neurons (Kurmangaliyev et al., 2019) . T4 and T5 neurons are very similar in terms of general morphology and physiological properties, but they differ by the location of their dendrites in the medulla and lobula, two distinct brain regions. T4 and T5 neurons each contain four subtypes, with each pair of the four laminating their axons in a specific layer in the lobula plate (Figure 3A ). Therefore, we can use two phenotypic features to describe these neurons: the feature i indicates the dendritic location at the medulla or lobula; the feature j describes the axonal lamination at one of the four layers (a/b/c/d) (Figure 3B ). In this experiment, we focused on the dataset containing expression data of 17492 genes from 3833 cells collected at a defined time during brain development.

5.2. DATA PREPROCESSING

The T4/T5 neuron dataset was preprocessed as previously reported (Shekhar et al., 2016; Peng et al., 2019; Tran et al., 2019) . Briefly, transcript counts within each column of the count matrix (genes×cells) were normalized to sum to the median number of transcripts per cell, resulting in the normalized counts Transcripts-per-median or T P M gc for Gene g in Cell c. We used the logtransformed expression data E gc = ln (T P M gc + 1) for further analysis. We adopted a common approach in single-cell RNA-Seq studies that is based on fitting a relationship between mean and coefficient of variation (Chen et al., 2016; Pandey et al., 2018) to select highly variable genes, and performed FLDA on the expression data with only these genes. We preprocessed the data with PCA and kept principal components (PCs) explaining ∼99% of the total variance before running FLDA but not the sparse version of the algorithm. In the experiment below, we set the hyper-parameters λs in Equation (2) to 1.

5.3. METRICS

We included the following metrics to evaluate our method: A signal-to-noise ratio (SNR) measures how well each discriminant axis separates cell types compared with noise estimated from the vari- ance within cell type clusters. The explained variance (EV) for each discriminant axis measures how much variance of the feature i or j is explained among the total variance explained by that axis. The mutual information (MI) between each discriminant axis and each feature quantifies how "informative" an axis is to a specific feature. Built on the calculation of MI, we included the modularity score which measures whether each discriminant axis depends on at most one feature (Ridgeway & Mozer, 2018) . The implementation details of these metrics can be found in Appendix E.

6. RESULTS

To quantitatively compare the difference between FLDA and other alternative models including PCA, CCA, LDA, and "2LDAs", we measured the proposed metrics from analyses of the synthesized datasets (Figure 2A ). Given that the synthesized data were organized in a 2x2 contingency table, each LDA of the "2LDAs" approach could find only one dimension for the specific feature i or j. Therefore, as a fair comparison, we only included the corresponding dimensions in FLDA (FLD i and FLD j ) and the top two components of PCA, CCA, and LDA. The overall SNR values normalized by that of LDA and the overall modularity scores were plotted for data generated with different noise levels (Figure 2B, C ). The performance of PCA is the worst among all these models because the unsupervised approach cannot prevent the noise from contaminating the signal. The supervised approaches in general have good SNRs, but LDA and CCA suffer from low modularity scores. This is expected because LDA maximizes the separation of cell type clusters but overlooks the alignment of the axes to the feature i or j, and CCA maximizes the correlation to a linear combination of phenotypic features instead of individual ones. By contrast, "2LDAs" achieves the highest modularity scores but has the worst SNR among the supervised approaches, because it tries to maximize the separation of cells with different feature levels, which is not necessarily consistent with maximizing the segregation of cell types. Both the SNR value and the modularity score of FLDA are close to the optimal, as it not only considers the alignment of axes to different features but also constrains the variance within cell types. A representative plot of the EV and MI metrics of these models is shown in Figure 5 , reporting good alignment of axes to either the feature i or j in FLDA and '2LDAs", but not in the others. A question of significance in neurobiology is whether the diverse phenotypes of neuronal cell types are generated by combinations of modular transcriptional programs, and if so, what is the gene signature for each of the programs. To illustrate the ability of our approach in addressing this problem, we applied FLDA to the dataset of Drosophila T4/T5 neurons. The T4/T5 neurons could be organized in a 2x4 contingency table, therefore, FLDA was able to project the expression data into a subspace of seven dimensions, with one FLD aligned with dendritic location i (FLD i ), three FLDs aligned with axonal termination j (FLD j 1-3 ), and the remaining three representing the interaction of both phenotypes (FLD ij 1-3 ). We ranked these axes based on their SNR metrics and found that FLD j 1 , FLD i , and FLD j 2 have much higher SNRs than the rest (Figure 3C ). Indeed, data representations in the subspace consisting of these three dimensions show a clear separation of the eight neuronal types (Figure 3D ). As expected, FLD i teases apart T4 from T5 neurons, whose dendrites are located at different brain regions (Figure 3E ). Interestingly, FLD j 1 separates T4/T5 neurons into two groups, a/b vs c/d, corresponding to the upper or lower lobula place, and FLD j 2 divides them into another two, a/d vs b/c, indicating whether their axons laminate at the middle or lateral part of the lobula plate (Figure 3E ,F). Unexpectedly, among these three dimensions, FLD j 1 has a much higher SNR than FLD i and FLD j 2 , whose SNR values are similar. This suggests a hierarchical structure in the genetic organization of T4/T5 neurons: they are first separated into either a/b or c/d types, and subsequently divided into each of the eight subtypes. In fact, this exactly matches the sequence of their cell fate determination during development, as revealed in a previous genetic study (Pinto-Teixeira et al., 2018) . Finally, the last discriminant axis of the axonal feature FLD j 3 separates the group a/c from b/d, suggesting its role in fine-tuning the axonal depth within the upper or lower lobula plate (Figure 3G ). To seek gene signatures for the discriminant components in FLDA, we applied the sparsity-based regularization to constrain the number of genes with non-zero weight coefficients. Here we set the number to 20, a reasonable number of candidate genes that might be tested in a follow-up biological study. We extracted a list of 20 genes each for the axis of FLD i or FLD j 1 . The relative importance of these genes to each axis is directly informed by their weight values (Figure 4A, C ). Side-by-side, we plotted expression profiles of these genes in the eight neuronal types (Figure 4B, D ). For both axes, the genes critical in separating cells with different feature levels are differentially expressed in corresponding cell types. We compared our gene lists with those obtained using conventional methods which were reported in Kurmangaliyev et al. (2019) . Consistent with the report, we found indicator genes for dendritic location such as TfAP -2 , dpr2, CG34155, and CG12065, and those for axonal lamination such as klg, bi, pros. In addition, we found genes that were not reported in this previous study. For example, our results suggest that the genes T hor and pHCl -1 are important to the dendritic phenotype, and Lac and M ip are critical to the axonal phenotype. These are promising genetic targets to be tested in biological experiments. Lastly, FLDA allowed us to examine the component that depends on the interaction of both features and identify its gene signature, which provides clues to transcriptional regulation of gene expressions in the T4/T5 neuronal types (Figures 6 and 7 ). As a supervised approach, FLDA depends on correct phenotype labels to extract meaningful information. But if the phenotypes are annotated incorrectly, can we use FLDA to raise a flag? We propose a perturbation analysis of FLDA to address this question built on the assumption that among possible phenotype annotations, the projection of gene expression data based on correct labels leads to better metric measurements than incorrect ones. As detailed in Appendix F, we generated three kinds of incorrect labels for the dataset of T4/T5 neurons, corresponding to three common scenarios of mislabeling: the phenotypes of a cell type were mislabeled with those of another type; a singular phenotypic level was incorrectly split into two; two phenotypic levels are incorrectly merged into one. FLDA was applied to gene expressions of T4/T5 neurons but with these perturbed annotations. Proposed metrics such as the SNR value and modularity score were plotted in Figure 8 . Indeed, the projection of gene expressions with correct annotation leads to the best SNR value and modularity score compared with incorrect annotations. This implies that this type of perturbation analysis is a useful practice in general: it raises the confidence that the original annotation is correct if FLDA on the perturbed annotations produces lower metric scores. 

7. DISCUSSION

We developed FLDA, a novel supervised linear dimensionality reduction method for understanding the relationship between high-dimensional gene expression patterns and cellular phenotypes. We illustrate the power of FLDA by analyzing a gene expression dataset from Drosophila T4/T5 neurons that are labeled by two phenotypic features, each with multiple levels. The approach allowed us to identify new genes for each of the phenotypic features that were not apparent under conventional methods. Furthermore, we found a hierarchical relationship in the genetic organization of these cells. These findings point the way for new biological experiments. The approach is motivated by multi-way ANOVA, and thus it generalizes easily to more than two features. Future applications in neurobiology include the analysis of phenotypic characteristics such as electrophysiology and connectivity (Zeng & Sanes, 2017; Gouwens et al., 2019; 2020) . More generally FLDA can be applied to any labeled data set for which the labels form a Cartesian product of multiple features. For example, this would include face images that can be jointly labeled by the age, gender, and other features of a person (Moghaddam & Ming-Hsuan Yang, 2002; Zhang et al., 2017) . FLDA factorizes gene expression data into features and their interactions, and finds a linear projection of the data that varies with only one factor but not the others. This provides a modular representation aligned with the factors (Bengio et al., 2012) . Ridgeway & Mozer (2018) argued that modularity together with explicitness could define disentangled representations. Our approach is linear, which presents an explicit mapping between gene expressions and phenotypic features, therefore our approach can potentially serve as a supervised approach to disentanglement (Kingma et al., 2014; Kulkarni et al., 2015; Karaletsos et al., 2016) . Compared with other non-linear embedding methods for cell types (Hinton & Salakhutdinov, 2006; Feng et al., 2014; Gala et al., 2019) , the linear nature of FLDA makes it extremely easy to interpret the low-dimensional representations, as the weight vector directly informs the relative importance of each gene. To allow the selection of a small set of critical genes, we leveraged our approach with sparse regularization. This makes FLDA especially useful to experimentalists who can take the list of genes and test them in a subsequent round of experiments.

8.1. A. OBJECTIVE FUNCTIONS

Here we derive the objective functions used in our analysis. Again if x ijk (k ∈ 1, 2, ...n ij ) represents the expression values of g genes in each cell (x ijk ∈ R g )), we seek to find a linear projection y ijk = u T x ijk that is aligned with the feature i.

8.1.1. INSPIRATION FROM ANOVA

We asked what is the best way to factorize y ijk . Inspired by multi-way ANOVA (Fisher, 1918) , we identified three components: one depending on the feature i, another depending on the feature j, and the last one depending on the interaction of both features. We therefore followed the procedures of ANOVA to partition sums of squares and factorize y ijk into these three components. Let us first assume that all cell types defined by i and j contain the same number of cells. With cell types represented by a complete contingency table (Figure 1A ), y ijk can be linearly factorized using the model of two crossed factors. Formally, the linear factorization is the following: y ijk = µ + α i + β j + (αβ) ij + ijk where y ijk represents the coordinate of the kth cell in the category defined by i and j; µ is the average level of y; α i is the component that depends on the feature i, and β j is the component that depends on the feature j; (αβ) ij describes the component that depends on the interaction of both features i and j; ijk ∼ N (0, σ 2 ) is the residual of this factorization. Let us say that the features i and j fall into a and b discrete categories respectively. Then without loss of generality, we can require: a i=1 α i = 0 (16) b j=1 β j = 0 (17) a i=1 (αβ) ij = b j=1 (αβ) ij = 0 Corresponding to these, there are three null hypotheses: H 01 : α i = 0 H 02 : β j = 0 H 03 : (αβ) ij = 0 Here we want to reject H 01 while accepting H 02 and H 03 in order that y ijk is aligned with the feature i. Next, we partition the total sum of squares. If the number of cells within each cell type category is n, and the total number of cells is N , then we have  where ȳ is the average of y ijk over the indices indicated by the dots. Equation ( 22) can be written as SS T = SS A + SS B + SS AB + SS e with each term having degrees of freedom N -1, a-1, b-1, (a-1)(b-1), and N -ab respectively. Here SS A , SS B , SS AB , and SS e are partitioned sum of squares for the factors α i , β j , (αβ) ij , and the residual. ANOVA rejects or accepts a null hypothesis by comparing its mean square (the partitioned sum of squares normalized by the degree of freedom) to that of the residual. This is done by constructing F-statistics for each factor as shown below: F A = M S A M S e = SS A a-1 SSe N -ab F B = M S B M S e = SS B b-1 SSe N -ab F AB = M S AB M S e = SS AB (a-1)(b-1) SSe N -ab Under the null hypotheses, the F-statistics follow the F-distribution. Therefore, a null hypothesis is rejected when we observe the value of a F-statistic above a certain threshold calculated from the F-distribution. Here we want F A to be large enough so that we can reject H 01 , but F B and F AB to be small enough for us to accept H 02 and H 03 . In other words, we want to maximize F A while minimizing F B and F AB . Therefore, we propose maximizing an objective L: L = F A -λ 1 F B -λ 2 F AB where λ 1 and λ 2 are hyper-parameters determining the relative weights of F B and F AB compared with F A .

8.1.2. OBJECTIVE FUNCTIONS UNDER A COMPLETE CONTINGENCY TABLE

When the numbers of cells within categories defined by i and j (n ij ) are not all the same, the total sum of squares cannot be partitioned as in Equation ( 22). However, if we only care about distinctions between cell types instead of individual cells, we can use the mean value of each cell type cluster (ȳ ij. ) to estimate the overall average value (ỹ ... ), and the average value of each category i (ỹ i.. ) or j (ỹ .j. ). Therefore, Equation ( 22) can be modified as the following:  If we describe Equation ( 28) as: SS T = SS A + SS B + SS AB + SS e (33) then following the same arguments, we want to maximize an objective function in the following format: L = SS A a-1 -λ 1 SS B b-1 -λ 2 SS AB (a-1)(b-1) SSe N -ab (34)

8.1.3. OBJECTIVE FUNCTIONS UNDER A PARTIAL CONTINGENCY TABLE

When we have a representation of a partial table, we can no longer separate out the component that depends on the interaction of both features. Therefore, we use another model, a linear model of two nested factors, to factorize y ijk , which has the following format: y ijk = µ + α i + β j(i) + ijk Note that we now have β j(i) instead of β j + (αβ) ij . In this model, we identify a primary factor, for instance, the feature denoted by i which falls into a categories, and the other (indexed by j) becomes a secondary factor, the number of whose levels b i depends on the level of the primary factor. We merge the component depending on the interaction of both features into that of the secondary factor as β j(i) . Similarly, we have a i=1 bi j=1 [ 1 n ij nij k=1 (y ijk -ỹ... ) 2 ] = a i=1 [ bi j=1 (ỹ i.. -ỹ... ) 2 ] + a i=1 bi j=1 (ȳ ij. -ỹi.. ) 2 + a i=1 bi j=1 [ 1 n ij nij k=1 (y ijk -ȳij. ) 2 ] which can be written as SS T = SS A + SS B + SS e (37) with degrees of freedom N -1, a -1, M -a, and N -M for each of the terms, where M is: M = a i=1 b i (38) Therefore, we want to maximize the following objective: L = SS A a-1 -λ SS B M -a SSe N -M (39) 8.2 B. FLDA WITH A PARTIAL CONTINGENCY TABLE Here we provide the mathematical details of FLDA under the representation of a partial table. When we have a partial table, if the feature i is the primary feature with a levels, and the feature j is the secondary feature with b i levels, then N A in Equation ( 1) is defined as follows: N A = M A -λM B|A ( ) where M A = 1 a -1 a i=1 bi j=1 (m i. -m .. )(m i. -m .. ) T M B|A = 1 M -a a i=1 bi j=1 (m ij -m i. )(m ij -m i. ) T ( ) and M is defined as in Equation ( 38). Correspondingly, M e in Equation ( 1) is defined as: M e = 1 N -M a i=1 b j=1 [ 1 n ij nij k=1 (x ijk -m ij )(x ijk -m ij ) T ] and m .. = 1 M a i=1 bi j=1 m ij ( ) m i. = 1 b i bi j=1 m ij The remaining mathematical arguments are the same as those for the complete table. In this scenario, because we don't observe all possible combinations of features i and j, we cannot find the linear projection for the interaction of both features.

8.3. C. PSEUDO-CODE FOR RIFLE

We show pseudo-code for the Rifle algorithm as follows: and i ∨ j = 0, if i = 0, j = 0 1, otherwise were interactions of the two features. Here ijk was driven by Gaussian noise, namely, ijk ∼ N (0, σ 2 ) (58) We synthesized datasets of 5 different σ values (σ ∈ {0.2, 0.4, 0.6, 0.8, 1.0}). This was repeated 10 times and metrics for each σ value were calculated as the average across the 10 repeats.

8.5. E. IMPLEMENTATION DETAILS OF THE METRICS USED IN THE STUDY

We measured the following metrics in our experiments: 8.5.1 SIGNAL-TO-NOISE RATIO (SNR) Because we care about the separation of cell types, we define the SNR metric as the ratio of the variance between cell types over the variance of the noise, which is estimated from within-cluster variance. For the entire embedding space, given q cell types, if the coordinate of each cell is indicated by c, then we define the overall SNR metric as the following: SN R overall = tr(Σ q p=1 n p (c p. -c.. )(c p. -c.. ) T )) tr(Σ q p=1 Σ np k=1 (c pk -cp. )(c pk -cp. ) T ) where cp. is the center of each cell type cluster, and c.. is the center of all data points. Let c denote the embedded coordinate along a specific dimension. The SNR metric for that axis is therefore: SN R = Σ q p=1 n p (c p. -c.. ) 2 Σ q p=1 Σ np k=1 (c pk -cp. ) 2

8.5.2. EXPLAINED VARIANCE (EV)

We want to know whether the variation of a specific dimension is strongly explained by that of a specific feature. Therefore, we measure, for each axis, how much of the total explained variance is explained by the variance of the feature i or j. Formally, given the embedded coordinate y ijk , we calculate the EV as the following: EV i = a i=1 b j=1 n ij (ȳ i.. -ȳ... ) 2 a i=1 b j=1 nij k=1 (y ijk -ȳ... ) 2 (61) EV j = a i=1 b j=1 n ij (ȳ .j. -ȳ... ) 2 a i=1 b j=1 nij k=1 (y ijk -ȳ... ) 2 (62) where ȳ is the average of y ijk over the indices indicated by the dots.

8.5.3. MUTUAL INFORMATION (MI)

The MI between a discriminant axis u and a feature quantifies how much information of the feature is obtained by observing data projected along that axis. It is calculated as the MI between data representations along the axis y = u T X and feature labels of the data f , where X is the original gene expression matrix: I(y, f ) = H(y) + H(f ) -H(y, f ) = - y∈Y p(y) log 2 p(y) - f ∈F p(f ) log 2 p(f ) - y∈Y f ∈F p(y, f ) log 2 p(y, f ) Here H indicates entropy. To calculate H(y) and H(y, f ), we discretize y into 10 bins. 8.5.4 MODULARITY Ridgeway & Mozer (2018) argued that in a modular representation, each axis should depend on at most a single feature. Following the arguments in their paper, the modularity score is computed as follows: we first calculate the MI between each feature and each axis (m if denotes the MI between one axis i and one feature f ). If an axis is perfectly modular, it will have high mutual information for only one feature and zeros for the others, we therefore compute a template t if as the following: t if = θ i , if f = arg max g m ig 0, otherwise where θ i = max g m ig . We then calculate the deviation from the template as: δ i = f (m if -t if ) 2 θ 2 i (N -1) where N is the number of features. The modularity score for the axis i is 1 -δ i . The mean of 1 -δ i over i is defined as the overall modularity score.

8.6. F. IMPLEMENTATION DETAILS OF ANNOTATION PERTURBATION

To evaluate the effect of mislabeling phenotypic levels, we made use of the dataset of T4/T5 neurons, and generated three kinds of perturbation to the original labels: First, we switched the phenotype labels of T4a neurons with one of the seven other types (T4b, T4c, T4d, T5a, T5b, T5c, T5d). In this scenario, phenotype labels of two cell types were incorrect, but the number of cell type clusters was the same. We had two levels of the dendritic phenotypes (T4/T5), and four levels of the axonal phenotypes (a/b/c/d). Therefore we kept one dimension for the dendritic feature, and three dimensions for the axonal feature. Second, we merged the axonal phenotypic level a with another level (b/c/d), as an incorrect new level (a+b/a+c/a+d). In this scenario, we had three axonal phenotypes, therefore we kept two dimensions for the axonal feature. Third, we randomly split each of the four axonal lamination labels (a/b/c/d) into two levels. For instance, among neurons with the original axonal level a, some of them were labeled with a level a1, and the others were labeled with a level a2. In this scenario, we had eight axonal phenotypes (a1/a2/b1/b2/c1/c2/d1/d2), and we kept seven dimensions for the axonal feature. We performed FLDA on the dataset of T4/T5 neurons but with these perturbed annotations. Metrics from each of the perturbed annotations were measured and compared with those from the original annotation. Laurenz Wiskott and Terrence J. Sejnowski. 



Figure 2: Quantitative comparison between FLDA and other models. (A) Illustration of data synthesis. See Appendix D for implementation details. Color bar indicates the expression values of the ten generated genes. (B) Normalized overall SNR metric of each analysis. The SNR values are normalized with respect to that of LDA. (C) Overall modularity score for each analysis. Error bars in (B,C) denote standard errors each calculated from 10 repeated simulations.

Figure3: FLDA on the dataset of T4/T5 neurons. (A) T4/T5 neuronal types and their dendritic and axonal phenotypes. (B) T4/T5 neurons can be organized in a complete contingency table.Here i indicates the dendritic location and j indicates the axonal termination. (C) SNR metric of each discriminant axis. (D) Projection of the data into the three-dimensional space consisting of the discriminant axis for the feature i (FLD i ) and the first and second discriminant axes for the feature j (FLD j 1 and FLD j 2 ). (E-G) Projection of the data into the two-dimensional space made of FLD i and FLD j 1 (E), FLD j 1 and FLD j 2 (F), or FLD j 2 and FLD j 3 (the third discriminant axis for the feature j) (G). Different cell types are indicated by different colors as in (A) and (D).

genes for the dendritic phenotype Selected genes for the axonal phenotype

Figure 4: Critical genes extracted from the sparse algorithm. (A) Weight vector of the 20 genes selected for the dendritic phenotype (FLD i ). The weight value is indicated in the color bar with color indicating direction (red: positive and green: negative) and saturation indicating magnitude. (B) Expression patterns of the 20 genes from (A) in eight types of T4/T5 neurons. Dot size indicates the percentage of cells in which the gene was expressed, and color represents average scaled expression. (C) Weight vector of the 20 genes selected for the axonal phenotype (FLD j 1 ). Legend as in (A). (D) Expression patterns of the 20 genes from (C) in eight types of T4/T5 neurons. Legend as in (B).

. -ȳi.. -ȳ.j. + ȳ... ) -ȳij. ) 2

ijk -ỹ... ) 2 ] = b a i=1 (ỹ i.. -ỹ... ) 2 + a b j=1 (ỹ .j. -ỹ... ) 2 . -ỹi.. -ỹ.j. + ỹ... ) 2 +

Figure 5: Representative plots (at σ = 0.6) of EV and MI metrics for FLDA and other models. (A,B) EV (A) and MI (B) metrics of FLDA. FLD i and FLD j indicate the factorized linear discriminants for features i and j. (C,D) EV (C) and MI (D) metrics of 2LDAs. LD i and LD j indicate the linear discriminant components for features i and j. (E,F) EV (E) and MI (F) metrics of LDA. LD 1 and LD 2 indicate the first two linear discriminant components. (G,H) EV (G) and MI (H) metrics of CCA. CCA 1 and CCA 2 indicate the first two canonical correlation axes. (I,J) EV (I) and MI (J) metrics of PCA. PC 1 and PC 2 indicate the first two principal components. EV i and EV j are the explained variance of features i and j along an axis, and MI i and MI j indicate the mutual inform between an axis and features i and j respectively. Values of EV and MI metrics are also indicated by the color bars on the right side.

Figure 7: Additional plots for critical genes extracted from the sparse algorithm. (A) Weight vector of the 20 genes selected for the interaction of both dendritic and axonal features (FLD ij 1 ). The weight value is indicated in the color bar with color indicating direction (red: positive and green: negative) and saturation indicating magnitude. (B) Expression patterns of the 20 genes from (A) in eight types of T4/T5 neurons. Dot size indicates the percentage of cells in which the gene was expressed, and color represents average scaled expression.

Slow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation, 14(4):715-770, April 2002. ISSN 0899-7667. doi: 10.1162/ 089976602317318938. Hongkui Zeng and Joshua R. Sanes. Neuronal cell-type classification: Challenges, opportunities and the path forward. Nature Reviews Neuroscience, 18(9):530-546, September 2017. ISSN 1471-0048. doi: 10.1038/nrn.2017.85. Zhifei Zhang, Yang Song, and Hairong Qi. Age Progression/Regression by Conditional Adversarial Autoencoder. arXiv:1702.08423 [cs], March 2017. Comment: Accepted by The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017).

annex

procedure RIFLE(N A , M e , u 0 , l, η) η is the step size t = 1 t indicates the iteration number while not converge do Converge when u t u t-1 To quantitatively compare FLDA with alternative approaches, we synthesized data of four cell types, each of which contained 25 cells. The four cell types were generated from a Cartesian product of two features i and j, where i ∈ {0, 1} and j ∈ {0, 1}. Expressions of 10 genes were generated for each cell. The expression value of the hth gene in the kth cell of the cell type ij, g h ijk was defined as the following:g 2 ijk = j + ijk (47)g 5 ijk = 2i + ijk (50)g 9 ijk = ijk (54)where 

