FACTORIZED LINEAR DISCRIMINANT ANALYSIS FOR PHENOTYPE-GUIDED REPRESENTATION LEARNING OF NEURONAL GENE EXPRESSION DATA

Abstract

A central goal in neurobiology is to relate the expression of genes to the structural and functional properties of neuronal types, collectively called their phenotypes. Single-cell RNA sequencing can measure the expression of thousands of genes in thousands of neurons. How to interpret the data in the context of neuronal phenotypes? We propose a supervised learning approach that factorizes the gene expression data into components corresponding to individual phenotypic characteristics and their interactions. This new method, which we call factorized linear discriminant analysis (FLDA), seeks a linear transformation of gene expressions that varies highly with only one phenotypic factor and minimally with the others. We further leverage our approach with a sparsity-based regularization algorithm, which selects a few genes important to a specific phenotypic feature or feature combination. We applied this approach to a single-cell RNA-Seq dataset of Drosophila T4/T5 neurons, focusing on their dendritic and axonal phenotypes. The analysis confirms results obtained by conventional methods but also points to new genes related to the phenotypes and an intriguing hierarchy in the genetic organization of these cells.

1. INTRODUCTION

The complexity of neural circuits is a result of many different types of neurons that specifically connect to each other. Each neuronal type has its own phenotypic traits, which together determine the role of the neuronal type in a neural circuit. Typical phenotypic descriptions of neurons include features such as dendritic and axonal laminations, electrophysiological properties, and connectivity (Sanes & Masland, 2015; Zeng & Sanes, 2017; Gouwens et al., 2019) . However, the genetic programs behind these phenotypic characteristics are still poorly understood. Recent progress in characterizing neuronal cell types and investigating their gene expression, especially with advances in high-throughput single-cell RNA-Seq (Zeng & Sanes, 2017), provides an opportunity to address this challenge. With massive data generated from single-cell RNA-Seq, we now face a computational problem: how to factorize the high-dimensional data into gene expression modules that are meaningful to neuronal phenotypes? Specifically, given phenotypic descriptions of neuronal types, such as their dendritic stratification and axonal termination, can one project the original data into a low-dimensional space corresponding to these phenotypic features and their interactions, and further extract genes critical to each of these components? Here we propose a new analysis method named factorized linear discriminant analysis (FLDA). Inspired by multi-way analysis of variance (ANOVA) (Fisher, 1918) , this method factorizes data into components corresponding to phenotypic features and their interactions, and seeks a linear transformation that varies highly with one specific factor but not with the others. The linear nature of this approach makes it easy to interpret, as the weight coefficients directly inform the relative importance of each gene to each factor. We further introduce a sparse variant of the method, which constrains the number of genes contributing to each linear projection. We illustrate this approach by applying FLDA to a single-cell transcriptome dataset of T4/T5 neurons in Drosophila (Kurmangaliyev et al., 2019) , focusing on two phenotypes: dendritic location and axonal lamination.

