FACTORIZED LINEAR DISCRIMINANT ANALYSIS FOR PHENOTYPE-GUIDED REPRESENTATION LEARNING OF NEURONAL GENE EXPRESSION DATA

Abstract

A central goal in neurobiology is to relate the expression of genes to the structural and functional properties of neuronal types, collectively called their phenotypes. Single-cell RNA sequencing can measure the expression of thousands of genes in thousands of neurons. How to interpret the data in the context of neuronal phenotypes? We propose a supervised learning approach that factorizes the gene expression data into components corresponding to individual phenotypic characteristics and their interactions. This new method, which we call factorized linear discriminant analysis (FLDA), seeks a linear transformation of gene expressions that varies highly with only one phenotypic factor and minimally with the others. We further leverage our approach with a sparsity-based regularization algorithm, which selects a few genes important to a specific phenotypic feature or feature combination. We applied this approach to a single-cell RNA-Seq dataset of Drosophila T4/T5 neurons, focusing on their dendritic and axonal phenotypes. The analysis confirms results obtained by conventional methods but also points to new genes related to the phenotypes and an intriguing hierarchy in the genetic organization of these cells.

1. INTRODUCTION

The complexity of neural circuits is a result of many different types of neurons that specifically connect to each other. Each neuronal type has its own phenotypic traits, which together determine the role of the neuronal type in a neural circuit. Typical phenotypic descriptions of neurons include features such as dendritic and axonal laminations, electrophysiological properties, and connectivity (Sanes & Masland, 2015; Zeng & Sanes, 2017; Gouwens et al., 2019) . However, the genetic programs behind these phenotypic characteristics are still poorly understood. Recent progress in characterizing neuronal cell types and investigating their gene expression, especially with advances in high-throughput single-cell RNA-Seq (Zeng & Sanes, 2017) , provides an opportunity to address this challenge. With massive data generated from single-cell RNA-Seq, we now face a computational problem: how to factorize the high-dimensional data into gene expression modules that are meaningful to neuronal phenotypes? Specifically, given phenotypic descriptions of neuronal types, such as their dendritic stratification and axonal termination, can one project the original data into a low-dimensional space corresponding to these phenotypic features and their interactions, and further extract genes critical to each of these components? Here we propose a new analysis method named factorized linear discriminant analysis (FLDA). Inspired by multi-way analysis of variance (ANOVA) (Fisher, 1918) , this method factorizes data into components corresponding to phenotypic features and their interactions, and seeks a linear transformation that varies highly with one specific factor but not with the others. The linear nature of this approach makes it easy to interpret, as the weight coefficients directly inform the relative importance of each gene to each factor. We further introduce a sparse variant of the method, which constrains the number of genes contributing to each linear projection. We illustrate this approach by applying FLDA to a single-cell transcriptome dataset of T4/T5 neurons in Drosophila (Kurmangaliyev et al., 2019) , focusing on two phenotypes: dendritic location and axonal lamination. j = 0 j = 1 j = 2 j = 3 i = 0 n00 n01 n02 n03 i = 1 n10 n11 n12 n13 i = 2 n20 n21 n22 n23 Phenotypic feature 2 Phenotypic feature 1 A Complete table B Partial table j = 0 j = 1 j = 2 j = 3 i = 0 n00 n02 n03 i = 1 n11 n12 n13 i = 2 n20 n21 n23 Phenotypic feature 2 Phenotypic feature 1 In the example, cell types are jointly represented by two phenotypic features, indexed with labels i and j respectively. If only some combinations of the two features are observed, one obtains a partial contingency table (B) instead of a complete one (A). (C) We seek linear projections of the data that separate the cell types in a factorized manner corresponding to the two features. Here u, v, and w are aligned with Feature 1, Feature 2, and the interaction of both features, with the projected coordinates y, z, and s respectively. C i = 0 j = 1 i = 1 j = 0 i = 1 j =

2. FACTORIZED LINEAR DISCRIMINANT ANALYSIS (FLDA)

Suppose that we are given gene expression data of single neurons which are typically very highdimensional. These cells are classified into cell types, as a result of clustering in the highdimensional space and annotations based on prior knowledge or verification outcome (Macosko et al., 2015; Tasic et al., 2016; Shekhar et al., 2016; Tasic et al., 2018; Peng et al., 2019) . We know the phenotypic traits of each neuronal type, therefore each type can also be jointed defined by the phenotypic features. We want to find an interpretable low-dimensional embedding in which certain dimensions represent factors of phenotypic features or their interactions. This requires that variation along one of the axes in the embedding space causes the variation of only one factor. In reality, this is hard to satisfy due to noise in the data, and we relax the constraint by letting data projected along one axis vary largely with one factor while minimally with the others. In addition, we ask that cells classified as the same type are still close to each other in the embedding space, while cells of different types are far apart. As a start, let us consider only two phenotypic features of neurons, dendritic stratification, and axonal termination, both of which can be described with discrete categories, such as different regions or layers in the brain (Oh et al., 2014; Euler et al., 2014; Sanes & Masland, 2015; Kurmangaliyev et al., 2019) . Suppose that each cell type can be jointly represented by its dendritic location indexed as i and axonal lamination indexed as j, with the number of cells within each cell type n ij . This representation can be described using a contingency table (Figure 1A, B) . Note here that we allow the table to be partially filled. Let x ijk (k ∈ 1, 2, ...n ij ) represent the expression values of g genes in each cell (x ijk ∈ R g )). How to find linear projections y ijk = u T x ijk and z ijk = v T x ijk that are aligned with features i and j respectively (Figure 1C )? We first asked whether we could factorize, for example, y ijk , with respect to components depending on features i and j. Indeed, motivated by the linear factor models used in multi-way ANOVA and the idea of partitioning variance, we constructed an objective function as the following, and found u * that maximizes the objective (see detailed analysis in Appendix A): u * = arg max u∈R g u T N A u u T M e u When we have a complete table, and there are a levels for the feature i and b levels for the feature j, we have N A = M A -λ 1 M B -λ 2 M AB where M A , M B , and M AB are the covariance matrices explained by the feature i, the feature j, and the interaction of them. λ 1 and λ 2 are hyper-parameters controlling the relative weights of M B



Figure1: Illustration of our approach. (A,B) In the example, cell types are jointly represented by two phenotypic features, indexed with labels i and j respectively. If only some combinations of the two features are observed, one obtains a partial contingency table (B) instead of a complete one (A). (C) We seek linear projections of the data that separate the cell types in a factorized manner corresponding to the two features. Here u, v, and w are aligned with Feature 1, Feature 2, and the interaction of both features, with the projected coordinates y, z, and s respectively.

