SEMI-SUPERVISED CONSISTENCY REGULARIZATION FOR ACCURATE CELL TYPE FRACTION AND GENE EX-PRESSION ESTIMATION

Abstract

Cell deconvolution is the estimation of cell type fractions and cell type-specific gene expression from mixed data with unknown composition. In biomedical research, cell deconvolution, which is a source separation task, is used to obtain mechanistic and diagnostic insights into human diseases. An unmet challenge in cell deconvolution, however, is the scarcity of realistic training data and the strong domain shift observed in synthetic training data that is used in contemporary methods. Here, we hypothesize that simultaneous consistency regularization of the target and training domains will improve deconvolution performance. By adding this biologically motivated consistency loss to two novel deep learningbased deconvolution algorithms, we achieve state-of-the-art performance on both cell fraction and gene expression estimation. Our method, DISSECT, outperforms competing algorithms across several gene expression datasets and can be easily adapted to deconvolve other biomedical data types, as exemplified by our spatial expression deconvolution experiments.

1. INTRODUCTION

A prominent approach to study tissue-specific gene expression changes in human development and disease is RNA sequencing (bulk RNA-seq). Tissues, however, usually consist of multiple cell types in different quantities, with different gene expression programs. As a consequence, bulk RNA-seq from tissues measures average gene expression across the constituent cells, disregarding cell typespecific changes. The quantification of the cellular composition and cell type-specific expression that underlies bulk RNA-seq data is therefore of pivotal importance to understand disease mechanisms and to identify potential therapeutic interventions (Li & Wang, 2021) . A recent technological advancement, single-cell RNA-seq, allows for the investigation of gene expression in single cells for thousands of individual cells of a given tissue sample in a single experiment. While it provides unprecedented insights into single cell biology, it suffers from severe technical limitations, most notably gene expression 'dropouts' (Lähnemann et al., 2020) . In addition, the technology is still very costly, which largely prohibits its application in clinical and diagnostic settings. Bulk RNA-seq, on the other hand, can be performed for a fraction of the cost and is widely used in clinical oncology and drug discovery (Zhou et al., 2019; Roberts et al., 2014) . To computationally infer cell fraction and cell type-specific gene expression information from bulk RNA-seq data, recent computational methods utilize single cell sequencing data to create simulated references with known fraction and expression information for training (Avila Cobos et al., 2020; Menden et al., 2020; Newman et al., 2019; Wang et al., 2019) . While this approach achieves good deconvolution results, its performance suffers from the strong domain gap between single-cell RNAseq training (reference) data and the bulk RNA-seq target data. Among many possible sources of variation, two most obvious are the presence of batch effects which refers to technological differences between two sequencing experiments and gene expression differences of biological nature. In the next section we formally define the task of cell deconvolution, and present our hypothesis that semi-supervised consistency regularization minimizes the bulk RNA-seq deconvolution error while learning from single cell RNA-seq data. Given an m × n gene expression matrix B consisting of m bulk gene expression vectors measuring n genes, the goal of deconvlution is to find a m × c matrix X of cell type fractions, where c is the number of cell types present in bulk samples such that, B = XS, where fractions and gene expression satisfy non-negativity 0 ≤ X ik , and 0 ≤ S kj , ∀i ∈ [1, m], ∀j ∈ [1, n] and ∀k ∈ [1, c] and sum-to-1 criterion i.e. The problem of reference-based cell deconvolution can alternatively be formulated as a learning problem, where a function f such that f (B) = X is learnt. Since only B is available and X is generally unknown, simulations from single-cell reference can be used to learn f . Clearly, from the above formulation of the cell deconvolution task, it is reasonable to assume linearity of deconvolution, i.e., each bulk mixture is a linear combination of expression vectors of cells spanned with corresponding cell type fractions. Thus, as defined in (Menden et al., 2020) , multiple single cells can be combined in random proportions to generate training examples B sim and X sim , where each row of B sim is defined as, B sim i• = c k=1 α k,i l=1 e k l , where e k l is expression vector of cell l belonging to cell type k, and α k,i is the number of cells belonging to cell type k sampled to construct B sim i• . Correspondingly, each element of X sim is proportion of a cell type k in that sample i and is defined as, X sim ik = α k,i c k=1 α k,i , and In this case, since each simulated sample has a distinct signature (i.e. gene expression profile), S is a three dimensional matrix with each element S kji denoting gene expression of gene j in cell type k for sample i. It is computed as following, S sim k•i = α k,i l=1 e k l α k,i The predictor f , learned from a simulated dataset, can then be applied to B to estimate X. Note that, the genes expressed may differ between vectors e l and B and as such before learning function f , each e k l is subsetted to include genes common with B. This is the reason why this learning problem is transductive and a separate model needs to be reconstructed for each B.

2.1.1. ASSUMPTION

From Section 2, it is evident that the relationship between B and S is linear. However, S is unobserved and learning is done using simulations. To address the inherent domain shift, we hypothesize that a consistency based regularization penalizing non-linearity of mixtures of real and simulated samples would result in a mapping f that is closer to true f . We define it in Section 2.1.2.



ik = 1, ∀i ∈ [1, m].Here, S is known as the signature matrix and is unobserved. Each row S k• is a gene expression profile (or signature) of cell type k. To utilize a reference based framework, S can be replaced with S ref derived from a single-cell experiment.

