SEMI-SUPERVISED CONSISTENCY REGULARIZATION FOR ACCURATE CELL TYPE FRACTION AND GENE EX-PRESSION ESTIMATION

Abstract

Cell deconvolution is the estimation of cell type fractions and cell type-specific gene expression from mixed data with unknown composition. In biomedical research, cell deconvolution, which is a source separation task, is used to obtain mechanistic and diagnostic insights into human diseases. An unmet challenge in cell deconvolution, however, is the scarcity of realistic training data and the strong domain shift observed in synthetic training data that is used in contemporary methods. Here, we hypothesize that simultaneous consistency regularization of the target and training domains will improve deconvolution performance. By adding this biologically motivated consistency loss to two novel deep learningbased deconvolution algorithms, we achieve state-of-the-art performance on both cell fraction and gene expression estimation. Our method, DISSECT, outperforms competing algorithms across several gene expression datasets and can be easily adapted to deconvolve other biomedical data types, as exemplified by our spatial expression deconvolution experiments.

1. INTRODUCTION

A prominent approach to study tissue-specific gene expression changes in human development and disease is RNA sequencing (bulk RNA-seq). Tissues, however, usually consist of multiple cell types in different quantities, with different gene expression programs. As a consequence, bulk RNA-seq from tissues measures average gene expression across the constituent cells, disregarding cell typespecific changes. The quantification of the cellular composition and cell type-specific expression that underlies bulk RNA-seq data is therefore of pivotal importance to understand disease mechanisms and to identify potential therapeutic interventions (Li & Wang, 2021) . A recent technological advancement, single-cell RNA-seq, allows for the investigation of gene expression in single cells for thousands of individual cells of a given tissue sample in a single experiment. While it provides unprecedented insights into single cell biology, it suffers from severe technical limitations, most notably gene expression 'dropouts' (Lähnemann et al., 2020) . In addition, the technology is still very costly, which largely prohibits its application in clinical and diagnostic settings. Bulk RNA-seq, on the other hand, can be performed for a fraction of the cost and is widely used in clinical oncology and drug discovery (Zhou et al., 2019; Roberts et al., 2014) . To computationally infer cell fraction and cell type-specific gene expression information from bulk RNA-seq data, recent computational methods utilize single cell sequencing data to create simulated references with known fraction and expression information for training (Avila Cobos et al., 2020; Menden et al., 2020; Newman et al., 2019; Wang et al., 2019) . While this approach achieves good deconvolution results, its performance suffers from the strong domain gap between single-cell RNAseq training (reference) data and the bulk RNA-seq target data. Among many possible sources of variation, two most obvious are the presence of batch effects which refers to technological differences between two sequencing experiments and gene expression differences of biological nature. In the next section we formally define the task of cell deconvolution, and present our hypothesis that semi-supervised consistency regularization minimizes the bulk RNA-seq deconvolution error while learning from single cell RNA-seq data.

2. CELL DECONVOLUTION

Given an m × n gene expression matrix B consisting of m bulk gene expression vectors measuring n genes, the goal of deconvlution is to find a m × c matrix X of cell type fractions, where c is the number of cell types present in bulk samples such that, B = XS, where fractions and gene expression satisfy non-negativity 0 ≤ X ik , and 0 ≤ S kj , ∀i ∈ [1, m], ∀j ∈ [1, n] and ∀k ∈ [1, c] and sum-to-1 criterion i.e. c k=1 X ik = 1, ∀i ∈ [1, m]. Here, S is known as the signature matrix and is unobserved. Each row S k• is a gene expression profile (or signature) of cell type k. To utilize a reference based framework, S can be replaced with S ref derived from a single-cell experiment. The problem of reference-based cell deconvolution can alternatively be formulated as a learning problem, where a function f such that f (B) = X is learnt. Since only B is available and X is generally unknown, simulations from single-cell reference can be used to learn f . Clearly, from the above formulation of the cell deconvolution task, it is reasonable to assume linearity of deconvolution, i.e., each bulk mixture is a linear combination of expression vectors of cells spanned with corresponding cell type fractions. Thus, as defined in (Menden et al., 2020) , multiple single cells can be combined in random proportions to generate training examples B sim and X sim , where each row of B sim is defined as, B sim i• = c k=1 α k,i l=1 e k l , where e k l is expression vector of cell l belonging to cell type k, and α k,i is the number of cells belonging to cell type k sampled to construct B sim i• . Correspondingly, each element of X sim is proportion of a cell type k in that sample i and is defined as, X sim ik = α k,i c k=1 α k,i , and In this case, since each simulated sample has a distinct signature (i.e. gene expression profile), S is a three dimensional matrix with each element S kji denoting gene expression of gene j in cell type k for sample i. It is computed as following, S sim k•i = α k,i l=1 e k l α k,i . The predictor f , learned from a simulated dataset, can then be applied to B to estimate X. Note that, the genes expressed may differ between vectors e l and B and as such before learning function f , each e k l is subsetted to include genes common with B. This is the reason why this learning problem is transductive and a separate model needs to be reconstructed for each B.

2.1.1. ASSUMPTION

From Section 2, it is evident that the relationship between B and S is linear. However, S is unobserved and learning is done using simulations. To address the inherent domain shift, we hypothesize that a consistency based regularization penalizing non-linearity of mixtures of real and simulated samples would result in a mapping f that is closer to true f . We define it in Section 2.1.2.

2.1.2. CONSISTENCY REGULARIZATION

Consider B represents gene expression matrices of real (i.e. test) bulk RNA-seq that we want to deconvolve and and B sim represents gene expression matrix of simulated bulk samples. The number of rows (representing samples) in these two matrices may differ. To simplify the notation, we use the same index i for real bulk samples, simulations (sim) and their mixtures (mix, defined further). Given a true bulk RNA-seq B i• , and a simulated sample with paired proportions (B sim i• , X sim i• ) defined over common gene-set, we can generate a mixture B mix i• such that B mix i• = βB i• + (1 -β)B sim i• , Which gives us relation X mix i• S mix •i• = βX i• S •i• + (1 -β)X sim i• S sim •i• . Cell types are characterized by a few marker genes that are invariant across cell states and even across tissues (Domínguez Conde et al., 2022) . A network that accurately predicts cell type fractions based on gene expression of simulated (or real) bulks would thus have to learn them. Thus, to estimate cell type fractions, we assume that the expression of these marker genes should be identical in signatures S mix •i• , S •i• and S sim •i• . Hence, X mix i• = βX i• + (1 -β)X sim i• , where β ∈ [0, 1]. Equation 4 enables the use of consistency regularization without having to explicitly estimate signatures. In an iterative learning process X i• can be replaced with predictions of the algorithm from the previous iteration. Naturally, it is also possible to only mix real samples with each other, however, the number of samples available from true bulk RNA-seq is considerably lesser (usually ranging from a couple to less than thousand) than the amount single-cells present in a single-cell experiment (usually in thousands). The equation 4 allows to generate pseudo ground truth proportions for mixtures B mix i• at the each step of learning cell type fractions, while Equation 3 allows to generate pseudo ground truth signatures at each step of learning gene expression profiles. We define the network architecture and loss functions in 2.2.

2.2. NETWORK ARCHITECTURE AND LEARNING PARADIGM

We approach the two tasks, estimation of cell type fractions and estimation of gene expression profiles per cell type as two different tasks because of their differing assumptions. For estimation of cell type fractions, we assume that signatures are identical for each sample, both simulated and bulk, while to estimate gene expression, we relax this condition use full consistency regularization (Equation 3).

2.2.1. ESTIMATION OF CELL TYPE FRACTIONS

The underlying algorithm of the first part of our deconvolution method is an average ensemble of multilayered perceptrons (MLPs). Each MLP consists of the same architecture initialized with different weights. This is done to reduce the variance by averaging different runs (Ju et al., 2018) . Each MLP has an architecture: Input (# genes) -ReLU6 (512) -ReLU6 (256) -ReLU6 (128) -ReLU6 (64) -Linear (# cell types) -Softmax. ReLU6 (Output of ReLU activation clipped by a maximum value of 6) (Hannun et al., 2014; Sandler et al., 2018) was chosen out of tested activations over grid search on [Linear, ReLU, ReLU6, Swish (Ramachandran et al., 2017) ] The final application of Softmax activation allows to achieve non-negativity and sum to 1 criteria of deconvolution. we train the network with batch size 64 to minimize the loss function defined below with an Adam Optimizer with initial learning rate of 1e -5. L total (X sim i• , f (B sim i• ), X mix i• , f (B mix i• )) = L KLdivergence (X sim i• , f (B sim i• )) + λ 1 * L cons (X mix i• , f (B mix i• )), where L KLdivergence (•, •) is the Kullback-Leibler divergence and L cons (•, •) is the consistency loss defined as:  L cons (X mix i• , f (B mix i• )) = ||X mix i• -f (B mix i• )|| 2 2 , X mix i• = βf (B i• ) + (1 -β)X sim i• . To generate mixtures, for each batch, we sample β and uniformly at random for Equation 4. The interval [0.1, 0.9] was chosen for the uniform distribution to allow for at least some real and some simulated gene expression in the mixture. This loss is similar to the semi-supervised framework proposed in MixMatch (Berthelot et al., 2019) . MixMatch uses unlabelled samples to MixUp and match sample predictions and generalizes semi-supervised framework, while the loss defined in Equation 5 addresses the limited samples available from true bulk RNA-seq, unavailability of sample fractions and is derived from the definition of task itself. In essence, Equation 5 integrates domain knowledge into the objective. To avoid a scenario where the network doesn't learn and outputs predictions such that f ( B mix i• ) = f (B sim i• ) = f (B i• ) , which is a solution to Equation 4, we first let the model learn purely from simulated examples. This allows the model to learn meaningful expression profiles to achieve accurate results on simulated examples. We selected λ 1 based on a grid search over constant and step-wise functions. We adopt a step-wise function for λ 1 , given as: λ 1 =    0 if step ≤ 2000, 15 elif 2000 ≤ step ≤ 4000, 10 else. We train the network for a predefined number of steps as opposed to epochs, since it is possible to generate infinitely many simulated samples without increasing the intrinsic dimensionality of the data. In our experiments, we set their number to 5000 as found optimal in Menden et al. (2020) .

2.2.2. ESTIMATION OF PER SAMPLE CELL TYPE SPECIFIC GENE EXPRESSION PROFILES

Estimation of cell type fractions from bulk RNA-seq requires an assumption that signatures of cell types are shared across single cell and bulk RNA-seq. However, cell type gene expression profiles (at least for genes that are not invariant across tissue states) may differ between samples. Previously, works such as CSx (Newman et al., 2019) and TAPE (Chen et al., 2021) have explored utilizing cell type fractions to estimate gene expression per sample. Here, we make use of a β-variational autoencoder with standard normal distribution as prior to estimate average gene expression of the different cell types from bulk RNA-seq expression levels. To jointly train the network on all cell types, we condition the decoder (at its input layer) with cell type labels. This allows for training a single model to estimate gene expression of each cell type for a sample. To make use of bulk RNA seq during the training, we regularize the reconstruction loss with a consistency loss defined over per cell type signature. Denoting f as before and g(•, k) as the output of the autoencoder with condition k (corresponding to cell type label) on the decoder input, this consistency loss is defined as: L VAE cons (f, g, B mix i• , B i• , X sim i• , S sim ki• ) = ||f (B mix i• ) k g(B mix i• , k)-βf (B i• ) k g(B i• , k)+(1-β)X sim i• S sim ki• || 2 2 , where B mix i is given by Equation 2, f (B mix i• ) k is the estimated proportion of cell type k in sample i. In implementation, we replace f (B mix i• ) k with βf (B i• ) k + (1 -β)X sim i• . Thus, this loss forces the learned signature for cell type k, g(B mix i• , k), to be closer to signatures for both real and simulated bulk samples. This loss function makes the assumption that mixing two bulk samples is similar to mixing individual cell type specific signatures that constitute those bulks. We added this loss function with a regularization parameter λ 2 (with default value 0.1) to the loss of the standard βvariational autoencoder (the weight on the KL divergence, denoted as β VAE , is set to 0.1 by default). The total loss function sums up to: L VAE total (f, g, B sim i• , B mix i• , B i• , X sim i• , S sim ki• ) = ||S sim ki• -g(B sim i• , k)|| 2 2 + λ 2 L VAE cons (f, g, B mix i• , B i• , X sim i• , S sim ki• ) + β V AE L KLdivergence (N (µ, σ), N (0, 1)), where N (0, 1) is standard normal distribution, µ and σ are the empirical mean and standard deviation estimated from the output of the encoder. Both the encoder and decoder consist of two hidden layers. We train the network to minimize the loss function with an Adam optimizer with initial learning rate of 1e -4. By default, the network is trained with a batch size of 32 for 10000 × c number of steps. The architecture of the network is summarized in Figure 1 .

3. RELATED WORK AND COMPETING ALGORITHMS

Several methods for cell deconvolution have been developed. Avila Cobos et al. (2020) and Jin & Liu (2021) provided a benchmark and review of state of the art cell deconvolution algorithms. Here, we focus on MuSiC, CSx, Scaden and TAPE although here are several additional methods such as DWLS (Tsoucas et al., 2019) and Bisque (Jew et al., 2020) etc., because both MuSiC and CSx are single cell reference based methods that have performed well on simulation studies in aforementioned benchmarking studies. Scaden and TAPE are selected as both are deep learning based deconvolution approaches. We briefly detail these approaches below. Out of these methods, CSx and TAPE can also estimate per sample cell type-specific gene expression signatures. MuSiC (Wang et al., 2019) uses weighted non-negative least squares. MuSiC maintains cross-cell and cross-sample consistencies by appropriately weighting genes based on their informativity during an iterative procedure. MuSiC is provided as an R package. Deconvolution using MuSiC was performed according to the authors recommendations. Since MuSiC is a method that utilizes multisubject scRNA-seq datasets, when available, we used cells from multiple subjects in deconvolution with MuSiC. CibersortX (CSx) (Newman et al., 2019) is a deconvolution method that addresses domain gap problems with scRNA-seq and bulk samples by aiming to correct batch effects. It uses scRNA-seq to generate a cell type specific signature matrix and uses ν-support vector regression as the underlying algorithm. CSx comprises two modes, S-and B-modes, to address the domain gap. S-mode is used when deconvolving with a signature matrix constructed using a scRNA-seq dataset, while B-mode is used when deconvolving with a signature matrix constructed using purified samples. We followed the documentation provided by the authors to run CSx and used the S-mode. CSx can also predict gene expression signatures for each sample for which it uses a Non-negative matrix factorization based iterative algorithm. Scaden (Menden et al., 2020) is an average ensemble of three deep neural networks with different architectures that was developed for cell fraction deconvolution. Each network is trained only on simulated pseudo bulk data generated from an scRNA-seq reference similar to described above. Scaden is provided as a Python package. We used the official Scaden package with the instructions provided by the authors to train the networks. TAPE (Chen et al., 2021) is a fully-connected autoencoder where the bottleneck consists of cell type fractions. The architecture of the encoder is similar to the archictecture of Scaden but with CeLU activations. The decoder consists of linear activations and outputs gene expression of the input vector. The adaptive mode of TAPE aims at optimizing the network for bulk samples, while the overall mode trains for fractions with an added loss function that reconstructs input bulk expression from fractions. Since TAPE-A reconstructs gene expression from fractions (bottleneck), the signature matrix is visible in the (linear) decoder. To estimate gene expression signatures for each bulk sample, decoder weights are optimized per-sample using an iterative optimization strategy. Network weights are changed during the two modes, we compare with both and refer to TAPE in overall mode as TAPE-O and in adaptive mode as TAPE-A. Linear MLPs: The solution to the deconvolution problem could be, in principle, a linear function. For this reason we also compared to an MLP ensemble that is based on the architecture in Section 2.2, but in which we replaced all non-linear activations with an identity function and removed the consistency loss.

4.1. DATASETS

We evaluated the algorithms on six datasets consisting of peripheral blood mononuclear cells (PBMCs) and corresponding ground-truth cell type fractions experimentally quantified using flow cytometry. Details of these datasets are given in Table 1 . To deconvolve these datasets, we used the PBMC8k (Table 2 ) as a reference single-cell dataset from a healthy donor for all methods considered here. To maintain same genes between the single-cell data and bulk RNA-seq, we subset both datasets over common gene-set. Number of common genes between PBMC8k and each of the bulk samples are as follows: SDY67: 10717, Monaco bulk: 13122, Monaco microarray: 13467, GSE65133: 10717, GSE107572: , GSE120502: 13699. To deconvolve with deep learning based methods (Scaden, TAPE-O, TAPE-A, Linear MLPs and DISSECT), we use this single-cell data from healthy donors to create simulated bulk data with known fractions as described in section 2. The non-deep learning methods (MuSiC and CibersortX) do not require simulations and as such single-cell data is used without simulations. For the estimation of cell type specific gene expression per sample, we utilized simulations in the absence of corresponding ground truth in the real bulk RNA-seq. For this, in addition to PBMC8k, we considered three other reference datasets, namely PBMC6k, donorA and donorC (Table 2 ). The results are given in Section 4.3. Since the number of cell types is unknown apriori in a bulk RNA-seq dataset that we want to deconvolve, we create an "unknown" cell type label in the reference dataset by merging cells not belonging to any of the cell types present in corresponding tissue (Menden et al., 2020; Chen et al., 2021) . This unknown cluster allows comparison of fractions measured at relative scale. PBMCs consist of five main cell types namely CD4 T cells, CD8 T cells, NK cells, Monocytes and B cells (Bittersohl & Steimer, 2016). Thereby, we end up with six cell types (including unknown cell type). Similarly, for the bulk RNA-seq datasets (Table 1 ), we grouped the ground truth cell type proportions not belonging to these five cell types in a single label "Unknown" following the methodology in Menden et al. (2020) . To preprocess single-cell datasets, we utilized the procedure described in Appendix C which includes quality control (QC) and simulations. A detailed information on on the parameters used for simulations are provided in Appendix C.2. For the Monaco bulk dataset (Table 1 ), more granular level fractions of cell type subsets are quantified using flow cytometry. We utilized this information to evaluate methods on discerning closely related or scarce cell type subsets. Since it is notoriously hard to identify cell types at such granularity in PBMC single-cell datasets from healthy donors, we considered 9852 RNA-seq samples of purified cells from Ota et al. (2021) as references. Purified RNA-seq samples are average profiles for thousands of cells of the same cell type. To match cell types between the purified reference and flow cytometry from Monaco bulk, we harmonize cell type labels. Resulting ground truth and reference had 18 cell subsets defined are given in Appendix D.

4.2. EVALUATION METRICS

We used Pearson correlation and root mean squared error (RMSE) for the evaluation of deconvolution results. Since some cell types are much more abundant than others, it is important to consider the overall and per cell type average correlation and RMSE (see Apppendix B).

4.3. RESULTS

In this section, we present results on experiments detailed in Section 4.1. Additional experiments and results on datasets without corresponding flow cytometry fractions are presented in Appendix E. There we utilize validated biological hypotheses to evaluate DISSECT against competing methods.

4.3.1. ESTIMATION OF CELLTYPE FRACTIONS

To evaluate deconvolution performance, we deconvolved each of the datasets in Table 1 using the PBMC8k reference dataset (Table 2 ). For MuSiC, we also evaluate using all 10x PBMC datasets listed in Table 2 as well as blood data from Immune Cell Atlas (ICA) (Appendix table 7 ) since MuSiC can take advantage of multi-sample reference (3). Tables 3 and 4 demonstrate that DISSECT shows significantly improved correlations in 9 out of 12 comparisons and the lowest RMSE in 11 out of 12 comparisons, across 6 different datasets. Next, we evaluated the cell fraction deconvolution performance on the Monaco bulk (Section 4.1) dataset that contains several closely related and rare cell types and constitutes a relatively harder task. With a correlation of 0.6, DISSECT's average performance is 14 percentage points better than the second best performance by Scaden (Appendix Table 8 ). For 8 out of 18 cell types it reaches the best correlation, while Scaden performs best for 3 out of 18 cell types. With an RMSE of 0.03, Scaden's performance is 1 percentage point better than DISSECT (Appendix Table 9 ). In summary, DISSECT displays the best correlation and a highly competitive RMSE in the cell fraction deconvolution task. Finally, we performed an ablation study that validates our hypothesis that the consistency loss is primarily responsible for DISSECT's improved deconvolution performance (Appendix Table 18 ). We also evaluated the output of DISSECT at the end of simulation-phase only. These results are provided in Appendix Tables 17 (Correlation) and 18 (RMSE) where simulation-based phase lags behind the full consistency-regularized training.

4.3.2. ESTIMATION OF CELL TYPE-SPECIFIC GENE EXPRESSION

Next, we evaluated the performance of DISSECT's cell type-specific gene expression inference on simulated bulk RNA-seq data. We used simulated data as we could not obtain bulk RNA-seq and corresponding cell type-specific expression information. To maintain a domain shift between the training and test datasets, simulated data for training and testing were created using different singlecell datasets. Here, we compare our approach with the only two state-of-the-art methods that can infer cell type-specific gene expression per sample, TAPE and CSx (Section 3). We simulated bulk samples from one of the four reference single-cell PBMC datasets listed in Table 2 , and created training simulations from the remaining three. To assess the performances, we computed sample-(Table 13 ) and gene-wise ( 

5. DISCUSSION

We detailed how the use of a linear consistency is suitable for the task of deconvolution, especially in the absence of real ground truth training information, as is often the case in biomedical settings. Our approach relies on the supervised learning on simulated data and an unsupervised domain adaptation to the target data of interest. This semi-supervised learning approach results in state-of-the-art deconvolution performance, for both cell fraction and gene expression estimation. While we only focused on MLPs for estimation of cell type fractions and autoencoders for gene expression estimation in this work, we surmise that consistency regularization might improve other deconvolution algorithms as well. We envision further work in this area. The task of deconvolution plays an important role in spatial transcriptomics (ST) and cell-free DNA methylation (cfDNA). Recently, several algorithms have been developed for ST (Li et al., 2022) and cfDNA deconvolution (Jeong et al., 2022) . We surmise that consistency regularization might also improve ST and cfDNA deconvolution by adjusting DISSECT's simulation procedure to mimic ST or cfDNA. We provide two proof-of-concept ST deconvolution results using consistency regularization in Appendix F. Similar arguments can be made for the deconvolution of several other biomedical data types, such as epigenetic, proteomic, and metabolomic data, for instance.

6. LIMITATIONS

While DISSECT displays favorable deconvolution performance compared to other methods, the results are far from perfect. Especially for hard deconvolution tasks, such as samples with many similar cell types and very scarce cell populations, an increase in deconvolution performance is warranted. Future research into semi-supervised and contrastive algorithms as well as data augmentation and integration techniques should further enhance DISSECT's performance. As stated in Section 4.3.2 we had to rely on the simulation based experiment to evaluate gene expression estimations. Nevertheless, we still explored how well DISSECT can estimate gene expression using an ST dataset (Appendix H). Further evaluations with quality ground truths will be beneficial.

7. CODE AND DATA AVAILABILITY

All considered datasets are publicly available from respective sources. Code is available anonymously at https://anonymous.4open.science/r/DISSECT-F0C4. A Before simulating from reference datasets, we remove cells with less than 200 expressed genes and genes which are expressed in less than 3 cells. Further, we also remove cells expressing more than 4% mitochondrial genes. Thereafter, before each deconvolution, we subset reference and bulk 1 ) and the estimations using Scaden. Figure 3 : Correlations between the ground truth fractions from GSE120502 (Table 1 ) and the estimations using TAPE-O. datasets to include only the common genes between the two. This quality control step was identical for all methods.

C.2 SIMULATIONS

For deep learning methods, we sampled α k,i uniformly to simulate based on procedure described in Section 2 with c k=1 α k,i = 100, ∀i if the dataset is single-cell. For experiments on granular level cell types where simulations are done from purified cell samples, we modify simulation procedure to reflect this. In this case, a simulated sample is given by B sim i• = c k=1 X sim ik b k l , where b k l is the expression vector of purified sample l belonging to cell type k. All other notations are same as in Section 2. For all experiments, we simulated total 1000 × c simulations where c is number of cell types in the reference dataset.

C.3 PREPROCESSING:

Estimation of cell type fractions: Scaden, TAPE, Linear MLPs and DISSECT: Before passing simulated and real bulk samples to the network, we normalize samples to sum to a million counts (CPM: Counts per million) and log scale them with base 2 after adding 1. CPM normalization was performed to maintain total mRNA expressed per gene to be out of a fixed total gene expression, and CPM is widely used in computational genomics. During training, for each batch, we normalize each sample by M inM ax scaling. These are standard preprocessing steps (Menden et al., 2020) . For MuSiC and CibersortX (under S-mode), data was supplied on a linear scale as suggested in their respective publications and no change was made to the default normalization methods of both (Wang et al., 2019; Newman et al., 2019) .

Estimation of per sample cell type specific gene expression profiles:

To estimate cell type specific gene expression profiles, we need to maintain relationship between gene expression of individual cell types and simulated bulks, which would be lost if we perform CPM normalization of both simulated samples and corresponding cell type specific gene expression profiles. Hence, instead of performing CPM normalization of simulated bulks, we normalize each test bulk sample to sum to the mean of sums of simulated samples. Further, for estimating cell type specific gene expression, we want to maintain gene level information across samples. To achieve this, instead of normalizing each sample using M inM ax scaling, we perform M inM ax scaling globally over all samples. For TAPE, since the signature matrix is observed in decoder (Section 3), preprocessing step is similar to the preprocessing done in estimating cell type fractions. For CibersortX, data was supplied on a linear scale under S-mode (Newman et al., 2019) . 

E FURTHER EVALUATION OF DECONVOLUTION PERFORMANCE USING DIVERSE TISSUE DATASETS.

To assess the performance of DISSECT and other algorithms on further bulk RNA-seq datasets, we consider additional experiments. In Section E.1, we consider paired scRNA-seq and bulk RNA-seq data. In Section E.2, we looked at the performance of the methods to recover established biological findings and in Section E.3, we assessed how the performance changes when the reference scRNAseq dataset is swapped with another reference. To evaluate deconvolution methods on further bulk RNA-seq datasets, we obtained paired scRNAseq and bulk RNA-seq from two tissues: mammary gland and lung. The details of these datasets are provided in Table 11 . The ground truth for bulk RNA-seq was generated using the fractions of cell types as observed in the scRNA-seq. The Tables 12 and 13 present the results on these two tissues. For the mammary gland dataset, the results are calculated per sample and averaged since the dataset contains only two samples as done in the original publication of data. In this section, we rely on the established biological findings to evaluate deconvolution methods. For this purpose we considered diverse set of tissues: brain, kidney and pancreas. Table 14 lists these datasets and corresponding hypothesis based on literature. The single cell datasets corresponding to these tissues are presented in 15. Here we are interested in investigating if the deconvolution methods are faithful to the established biological findings (Presented and discussed further in this Section). We are also interested in how (Streit et al., 2009; Hindle, 2010; Fu et al., 2019) , and 2. Approximately 70-30 ratio of excitatory and inhibitory neurons (Contreras, 2004; Chen & Dzakpasu, 2010) . Bennett et al. (2018) methods behave when different single-cell reference datasets are used (Presented and discussed in Section E.3). Deconvolution of ROSMAP with reference scRNA-seq from Allen Brain Atlas: ROSMAP cohort consists of samples from healthy individuals and patients with Alzheimer's disease (AD). Here, we consider two biological ground truths: first is the neurodegeneration, or the loss of neurons with increasing Braak Stages (Braak et al., 2003) (Table 14 ), and the second is the ratio of excitatory neurons to inhibitory neurons. We deconvolved the ROSMAP using reference from Allen Brain Atlas. The results are presented in Figure 8 . Nearly all methods capture a negative association between the median fractions of excitatory neurons and Braak stages. Scaden, TAPE-O, DISSECT maintain a higher proportion of excitatory neurons compared to inhibitory neurons. However, DISSECT estimates show the excitatory-inhibitory neurons ratio to be almost 70-30. TAPE-A and MuSiC on the other hand show opposite of what is expected. Deconvolution of GSE50244 with reference scRNA-seq from Segerstolpe: GSE50244 is a bulk RNAseq dataset from pancreas and consists of samples from healthy and T2D (Type 2 diabetes) individuals. Here our biological ground truth is the negative association between beta cell proportions and the level of hemoglobin A1c (hba1C) (Table 14 ). We performed the deconvolution using Segerstolpe. We restricted dataset to contain alpha, beta, gamma, delta, acinar and ductal cell types following the methodology in Wang et al. (2019) . The results are presented in Figure 4 . W note that all deconvolution methods successfully reveal the significant association between beta cell proportions and hba1c level, however since the extent of the association is unknown, further quantification would be speculative. Deconvolution of GSE81492 with reference scRNA-seq from Park: GSE81492 is a dataset consisting of APOL1 mutant mice (a Chronic Kidney, CKD, disease mouse model) (Table 14 ). Here the biological ground truth is the decrease in tubule cells -Proximal tubule (PT) cells ductal convoluted tubule (DCT) cells in CKD samples compared to the healthy state. We deconvolved the aforementioned dataset using single cell reference dataset Park. We present the results per method per reference in Figure 5 . All methods reveal loss of proximal tubule (PT) cells in APOL1 mice, while showing higher proportion of immune cells (B lymph, Fib, Maco and NK cells) in APOL1. However, DCT cells (Ductal convoluted tubule) cells which are also known to diminish in the APOL1 mice, MuSiC shows an increase. This is also observed in Wang et al. (2019) . However, other methods were successful in revealing loss of DCTs in APOL1.

E.3 RELATIONSHIP BETWEEN CELL TYPE FRACTIONS AND BIOLOGICAL PHENOTYPES USING MULTIPLE REFERENCE SCRNA-SEQ

Deconvolution of GSE50244 with reference scRNA-seq from Segerstolpe, Baron and Xin: For GSE50244, we performed deconvolution using three reference single-cell datasets. These references differ in technologies with which the cells were sequenced. These technologies are Baron: inDrop, Segerstolpe: Smart-seq2, and Xin: SMARTer. Further, these reference datasets contain cells belonging to different states. We used cells belonging to only healthy individuals in Baron and Segerstolpe while both healthy and T2D individuals are used in Xin. Further, since our goal is to compare the divergence in performance when the reference single-cell dataset is changed, we subsetted all three single-cell datasets to contain same cell types. Figure 6 shows the distribution of alpha, beta, gamma and delta cells for each deconvolution method. There is a lack of concordance between three distributions of cell types across all methods. Beta and alpha cells are generally the two most abundant of these four cell types in pancreatic islets (Henquin & Rahier, 2011) . This is correctly observed with estimations from the considered deep learning based methods. However, for Scaden, the relative proportions of alpha and beta cells are inverted between Baron and Xin. While for DISSECT, they are predicted almost at the same level for the three datasets, which more varying beta cell fractions when Baron is the reference. TAPE-O also achieves this trend, however, TAPE-O incorrectly predicts Delta cells as being at the same level as alpha cells for reference Baron. Linear MLPs show the most variance and predict almost 80% alpha cells for Segerstolpe and Xin. Further, with Linear MLPs, the least abundant gamma and delta cells are predicted to be negative for almost all samples. Next, we looked at the association between beta cell fractions and T2D severity (Table 14 ). Across all three datasets, DISSECT estimations are significantly negatively correlated with hb1ac. As in other experiments, we observe a wide discrepancy between TAPE-O and TAPE-A results. Deconvolution of GSE81492 with reference scRNA-seq from Park and Miao: We deconvolved GSE81492 using two single cell reference datasets Park and Miao. Here, we subsetted these two reference datasets to contain same cell types. We present the results per method per reference in Figure 9 . To enable comparisons, same y-axes were used for both single-cell datasets. Both of these datasets come from same technologies (10x Genomics). Despite this, almost all methods show variation in their estimates when changing the reference, wit DISSECT showing the least variance and giving similar associations between cell type fractions and tissue state. These experiments show that while deconvolution methods in general follow biology but the results differ across the single-cell reference used. DISSECT, however, shows more robustness compared to other methods in this regard.

F APPLICATION TO SPATIAL TRANSCRIPTOMICS

Here we illustrate applicability of DISSECT on spatial transcriptomics (ST). We focused on two publicly available tissue slides (Mouse brain and human lymph node) on which ST has been performed. In brain, cortical neuronal layers are structured spatially. To verify whether DISSECT estimates may be valid in ST, we deconvolved a sagittal mouse brain slice available as part of Seuratfoot_2 . As reference, we used a mouse brain single-cell dataset from Allen Brain Institute consisting of approximately 14000 cells sequenced using Smart-seqv2 protocol (Tasic et al., 2016) . We adjusted simulation procedure to mimic ST datasets. 10x Visium (one of the technologies to generate ST samples) consists of around 10 cells per spotfoot_3 foot_4 .To reflect this, we simulated between 5-12 cells to generate one spot (i.e. [5, 12] ). Since ST is much sparser, to generate one spot, we kept between 2-6 cell types. Figure 10 shows fractions of cell types overlaid on the hematoxylin and eosin (H&E) stained images of tissue slide, and Figure 11 shows jointly cortical neuronal proportions which shows a spatially structured arrangement of neurons. To evaluate how DISSECT behaves on ST deconvolution on granular level subsets, we deconvolved a human lymph node slide using corresponding integrated single-cell datasets on which granular level cell types are annotated. Both of these datasets are obtained from (Kleshchevnikov et al., 2022) . Remarkably, DISSECT is able to identify spatial patterns of cell type fractions, along with correctly predicting co-localization of cycling B cells and germinal center B cells. This is illustrated in Figures 13 and 14 . Several germinal center zones are correspondigly visible in h&E stained image (Figure 13 ). These results demonstrate the usability of DISSECT on spatial deconvolution and warrant evaluation of consistency loss further in data modalities other than bulk RNA-seq. 2 ). This is due to the unavailability of bulk RNA-seq from paired tissue and cell populations. However, to investigate further how DISSECT performs in practice, we investigated quallity of gene expression estimation for brain ST using scRNA-seq data from Allen Brain Atlas. Details of both datasets are provided in Section F. In this experiment, we were interested in answering two questions: 1. Does DISSECT estimates of gene expression reflect what is observed for that cell type in the literature? 2. Can DISSECT identify heterogeneity of the same cell type across samples (in this case spots) without having to pre-annotate cell type subsets? To accomplish answering of the second question, we merged excitatory neuronal subsets together and labeled them as "exc neurons". This allows us to test whether we observe heterogeneity in excitatory neurons after estimation or not. This resulted in 17 final cell types. We also filtered our cells where the proportion of corresponding cell type is less than 1/(no. of cell types). This is reasonable as 10x Visium spots contain between 1-10 cells.

G ABLATION

The first question relates to accuracy of the predicted signature, and the second question is about whether the biological reality of the sample at hand is preserved. To evaluate our results, we first computed PCA and UMAP embeddings of the predicted cell type specific gene expression and identified disjoint cell type clusters (Figure 15 ). Second, we tested for differential expression (DE) of genes for each cell type. Top 5 DE genes for each cell type are visualized over the ST slide (Figure 16 ). To verify whether the DE genes make sense in the broader context of the literature, we performed gene set enrichment using Enrichrfoot_5 with gene sets available from PanglaoDB Franzén et al. (2019) , a curated database of single-cells from different tissues. The results are presented in Figures 17 and 18 . Correct gene sets are enriched for each predicted signature (e.g. Astrocyte for cluster "Astro", Neurons or interneurons for neuronal subsets such as Lamp5, exc neurons etc., Microglia and Macrophage for "Macrophage", Oligodendrocytes for the cluster "Oligo"). Next, we focused on the excitatory neurons. Figure 19 shows the expression of expected positive and negative marker genes (taken from the Allen Brain Atlas) over the ST slide. We observe that excitatory neurons do express positive markers but not the negative ones. This positively supports the first of our aforementioned questions. To investigate whether we observe spatial heterogeneity in the predicted gene expression of excitatory neurons, we performed unsupervised clustering using louvain clustering with default resolution of 1. Figure 20 shows the louvain clusters over UMAP and over the ST slide. We observe that clusters have spatial variability and may be linked with the location. To verify this further, we looked at some genes, Cux2, Rorb and Fezf2, which are used in creating a neuronal taxonomy and in situ validation of excitatory cell types (Hodge et al. (2019) , Tasic et al. (2016) ). We observe the expression of these genes at the correct spatial locations (Cux2, Rorb and Fezf2 in this order with increased depth). Further, since these are only three genes and taken from the literature, we wanted to look into what the genes differentially expressed amongst these 9 clusters indicate. To this end, we performed DE analysis and used Allen Brain Atlas Up gene sets that are included in Enrichr 6 . Here we couldn't use PanglaoDB as it does not provide detailed taxonomy gene sets of neurons. Figure 21 presents the results of the gene set enrichment. In total, six of the clusters resulted in DE gene sets with default settings (p adjusted value cutoff of 0.01 and absolute log2 fold change cutoff of 1). We identify that each cluster is associated with certain brain regions. This positively supports our second question regarding identification of heterogeneity within a cell type label. Allen Brain Atlas is used as reference single-cell data (Table 15 ). Rows indicate methods and each column is a cell type. OPC: Oligodendrocyte Precursor Cells. Excitatory and Inhibitory are two neuron subsets. Last two columns show fractions of excitatory and inhibitory neurons out of total neuronal content. 



https://support.10xgenomics.com/single-cell-gene-expression/datasets Calculated over real values. https://satijalab.org/seurat/articles/spatial_vignette.html Each spot is a location which is sequenced in tissue slide. Thus, each spot is analogous to a bulk RNA-seq, albeit much sparser due to less number of cells per spot. https://kb.10xgenomics.com/hc/en-us/articles/360035487952-How-many-cel ls-are-captured-in-a-single-spot- https://maayanlab.cloud/Enrichr/



Figure 1: A. Illustration of simulation using reference single-cell data. The figure shows the simulation of one sample which consists of cell type fractions, simulated gene expression and cell type specific gene expression profiles (i.e. signature matrix). B. Detailed overview of an MLP used to estimate cell type fractions, and C. Overview of an autoencoder used to estimate cell type specific gene expression profiles.

): B Naive (Naive B cells), B Ex (Exhausted B cells), B NSM (Non-switched memory B cells), B SM (Switched memory B cells) CD4 T cells (2 subsets): CD4 T Naive (Naive CD4 T cells), CD4 T Memory (Memory CD4 T cells), CD8 T cells (4 subsets): T CD8 Naive (Naive CD8 T cells), CD8 T CM (Central Memory CD8 T cells), CD8 T TE (Terminally effector CD8 T cells), and CD8 T EM (Effector Memory CD8 T cells), Monocytes (3 subsets): Monocytes C (Classical monocytes), Monocyte NC (Non classical monocytes) and Monocytes I (Intermediate monocytes), Dendritic cells (2 subsets): mDC (myeloid dendritic cells), pDCs (Plasmacytoid dendritic cells) Plasmablasts Neutrophils LD (Low density neutrophils) NK cells.

Details on single-cell datasets used to deconvolve corresponding tissue samples.

Figure 4: A: Distribution of relative fractions of alpha, beta, gamma, delta, acinar and ductal cells estimated on 77 bulk RNA-seq samples from GSE52044 using Seger single-cell reference dataset. B: Relationship between Hemoglobin A1C (hba1c) levels and estimated fractions of beta cells in 77 bulk RNA-seq samples from GSE52044 using Seger. Corresponding Method for each plot is indicated in the title. Pearson correlation of the relationship and associated p-value is indicated in the plot. p-value shown is obtained for beta cells from a multiple linear regression model considering age, sex and bod mass index (BMI) as covariates, i.e. model hb1ac ∼ Constant + fractions of beta cells + Age + Sex + BMI.

Figure 5: Estimated fractions of different cell types in 10 bulk RNA-seq samples from GSE81492 (Ctrl: Control mice, n=6 and APOL1: Apolipoprotein L1 transgenic mice, n=4) mice using singlecell reference dataset Park. Corresponding Methods are indicated in the title. Each row corresponds to a cell type. DCT: Distal convoluted tubule, Endo: Endothelial cells, LOH: Loope of Henle, Macro: Macrophages, Neutro: Neutrophils, PT: Proximal Tubule, Podo: Podocytes, CD-PC: collecting duct principal cell; CD-IC: collecting duct intercalated cell.

Figure6: Distribution of relative fractions of alpha, beta, gamma and delta cells (present in all three single cell datasets) estimated on 77 bulk RNA-seq samples from GSE52044 using three different single-cell reference datasets. Each row corresponds to the single-cell reference dataset.

Figure 7: Relationship between Hemoglobin A1C (hba1c) levels and estimated fractions of beta cells in 77 bulk RNA-seq samples from GSE52044 using three different single-cell reference datasets. Corresponding Method for each plot is indicated in the title. Each row corresponds to the single-cell reference dataset. Pearson correlation of the relationship and associated p-value is indicated in the plot. p-value shown is obtained for beta cells from a multiple linear regression model considering age, sex and bod mass index (BMI) as covariates, i.e. model hb1ac ∼ Constant + fractions of beta cells + Age + Sex + BMI.

Figure 8: Fractions of cell types in 463 bulk RNA-seq samples from ROSMAP Alzheimer's Disease cohort for whom corresponding Braak stages are available.Allen Brain Atlas is used as reference single-cell data (Table15). Rows indicate methods and each column is a cell type. OPC:

Figure 9: Estimated fractions of different cell types in 10 bulk RNA-seq samples from GSE81492 (Ctrl: Control mice, n=6 and APOL1: Apolipoprotein L1

Figure 12: Estimated fractions of 34 granular level cell types overlaid on H&E image of a lymph node tissue slice.

Figure 13: H&E image of lymph node tissue slice.

Figure 14: Fractions of Cycling and light zone (LZ) and dark zone (DZ) Germinal center B cells expected to be present in germinal centers.

Figure 15: A PCA and UMAP embeddings of estimated gene expression profiles. B: PCA and UMAP embeddings computed on neuronal clusters. C: Clustered matrix showing Pearson correlation between each pair of cell types.

Figure 16: Scaled expression of top five DE genes for cell types shown over the H&E tissue slide. Rows indicate cell types.

Figure 17: Plots showing gene set enrichment results of each cell type. For each cell type, the DE genes were selected with adjusted value cutoff of 0.01 and absoluted log 2 fold change cutoff of 1. PanglaoDB Augmented 2021 gene sets were used as background. 34

Figure19: Evaluation of known positive and negative marker genes for excitatory neurons. The negative marker genes are the genes highly expressed on other cell types compared to excitatory neurons. Sox10, Olig1: Oligodendrocytes, Sst: Sst neurons, Gfap: Astrocytes, Ctss: Microglia, Gad1, Gad2: Inhibitory neurons, Cldn5: Endothelial cells.

Figure 20: Top: (From left to right) Louvain clustering on estimated gene expression on excitatory neurons. log 2 gene expression of Cux2, Rorb and Fezf2. Bottom left: Louvain clusters identified on estimated excitatory neurons visualized on top of H&E slide and Bottom right: Expressions of Cux2, Rorb and Fezf2 in excitatory neurons jointly visualized over H&E slide.

Figure 21: Gene set enrichment of DE genes computed for each louvain cluster on excitatory neurons. Enrichr was used to perform gene set enrichment of each DE gene set with Allen Brain Atlas Up gene sets as background. The x-axis indicated -log 10 adjusted p-value. A higher value indicates greater significance. The y-axis lists the enriched gene sets ordered by decreasing significance.

Real bulk datasets with known ground truth cell fraction information for the evaluation of deconvolution performance.

Single-cell data used for the creation of simulated reference data for supervised training. We manually annotated cells with these five cell types based on expression of cell type marker genes. QC refers to Quality Control step (Appendix C.3) performed to filter out poor quality cells and genes that are expressed in too few cells.

Average (overall)  Pearson correlation between estimations and flow cytometry cell type fractions computed over all cell types.

Average (overall)  RMSE between estimations and flow cytometry cell type fractions computed over all cell types.



Pearson correlation between ground truth and estimated gene expression profiles on simulated datasets averaged over samples. The column Dataset indicates the single-cell dataset used to create simulations for the test set.

Pearson correlation between ground truth and estimated gene expression profiles on simulated datasets averaged over estimated genes.

Details on Blood single-cell data from Immune Cell Atlas(Domínguez Conde et al. (2022)) used as a multi-sample reference scRNA-seq for MuSiC. Cell types that were present in less than 5 samples were dropped. The same cell types were selected as in scRNA-seq datasets listed in Table2: Bcells, CD4Tcells, CD8Tcells, Monocytes and NK. The rest of the cell types were merged to form the unknown cluster (Section 4.1).

Pearson correlation between estimates from different methods and flow cytometry for granular cell type fractions in Monaco bulk.

RMSE between estimates from different methods and flow cytometry for granular cell type fractions in Monaco bulk.

Harmonization of cell subset labels between Monaco bulk and reference from Ota et al.

Details on paired single-cell and bulk RNA-seq datasets considered. Cell type labels were used as provided in the corresponding sources.

Average Pearson correlation and RMSE between ground truth and estimated cell type fractions on the Lung dataset over all cell types.

Average Pearson correlation and RMSE between ground truth and estimated cell type fractions on the Mammary gland dataset over samples.

Details on bulk RNA-seq datasets used to evaluate deconvolution methods on biological phenotypes. Biological hypotheses based on literature serve as proxy ground truths.

Average performance over five random experiments for SDY67 (Table1.) Each column indicates the additional part.

Average Pearson correlation between estimations and flow cytometry cell type fractions for the only after simulation phase (Step 2000, λ = 0) in comparison to the full training.

Average RMSE between estimations and flow cytometry cell type fractions for the only after simulation phase (Step 2000, λ = 0) in comparison to the full training.

