In single-cell RNA-seq (scRNA-seq) experiments, the number of individual cells has increased exponentially, and the sequencing depth of each cell has decreased significantly. As a result, analyzing scRNA-seq data requires extensive considerations of program efficiency and method selection. In order to reduce the complexity of scRNA-seq data analysis, we present scedar, a scalable Python package for scRNA-seq exploratory data analysis. The package provides a convenient and reliable interface for performing visualization, imputation of gene dropouts, detection of rare transcriptomic profiles, and clustering on large-scale scRNA-seq datasets. The analytical methods are efficient, and they also do not assume that the data follow certain statistical distributions. The package is extensible and modular, which would facilitate the further development of functionalities for future requirements with the open-source development community. The scedar package is distributed under the terms of the MIT license at https://pypi.org/project/scedar.
Cost-effective large-scale transcriptomic profiling of individual cells is enabled by the development of microfluidic, nanodroplet, and massively parallel sequencing technologies. Using these technologies, single-cell RNA-seq (scRNA-seq) experiments usually generate transcriptomic profiles of thousands to millions of individual cells . Therefore, scRNA-seq has become more commonly used to either study specific biological questions or comprehensively profile certain tissues or organisms [2–4].
Analyses of scRNA-seq datasets therefore require efficient computational programs and sophisticated statistical methods. The programs should be able to manage memory efficiently, exploit multiple cores of the processing units, and handle errors and exceptions gracefully. The statistical methods must be able to function against high dimensionality, low signal-to-noise ratio, and different characteristics of data generated from different technologies and protocols [5–7]. Such requirements can become a barrier between experimental design and biological interpretations of the results.
In order to be scalable, methods have been designed to minimize the usage of hardware resources, so that a large-scale scRNA-seq dataset can be analyzed using a desktop computer, such as Seurat v3.0  and Scanpy . Seurat is an R package providing visualization and robust statistical methods to explore and interpret the heterogeneity of the dataset. Scanpy is a Python package providing efficient reimplementations of pre-existing statistical methods and analytical workflows, which can be used to perform general exploratory data analysis and special inferences.
Here, we attempt to achieve the scalability of scRNA-seq data analysis through an alternative approach, which is to enable the user to exploit powerful analytical methods using modern high-performance computing architectures [10–12], such as servers and clusters with large amount of memory, multiple central and graphical processing unit cores, and solid-state drives (SSD) with higher read and write speed than traditional hard disk drive (HDD). Such computational resources have been made easily accessible by cloud computing services like Amazon Web Services, Google Cloud, and Microsoft Azure. Applying this approach, we designed the analytical methods to be able to run in parallel processes with minimized memory overhead.
Additionally, we also aim to develop robust exploratory data analysis (EDA) methods, so that they could be applied to various datasets generated by different experimental designs, technologies, and protocols. Single-cell RNA-seq datasets have distinct statistical characteristics if generated from different scRNA-seq technologies and platforms [6,7,13], such as SMARTer , Drop-seq  and 10x Genomics . In order to avoid assumptions on the statistical distributions of the datasets, we incorporated efficient implementations of machine learning methods into the data exploration process. Through extensive exploration, the statistical properties of the data could be observed and used to guide the selection of appropriate methods for biological interpretation .
Therefore, we developed a scalable and reliable Python package, single-cell exploratory data analysis for R NA-seq (scedar), to facilitate the exploration of large-scale scRNA-seq datasets. Scedar provides analytical routines for visualization, gene dropout imputation, rare transcriptomic profile detection, clustering, and identification of cluster separating genes. The visualization methods are integrated with the efficient scRNA-seq data structures to provide intuitive, convenient, and flexible plotting interfaces. We implemented methods to impute gene dropouts (Algorithm A in S1 Text) and detect rare transcriptomic profiles (Algorithm B in S1 Text) based on the k-nearest neighbor (KNN) algorithm. The detected rare transcriptomic profiles could be compared with their nearest neighbors in detail to identify rare cell states or types. For clustering analysis, we provide a novel cell clustering algorithm: Minimum Description Length (MDL) Iteratively Regularized Agglomerative C lustering (MIRAC), shown in Algorithm 1. To identify genes that distinguish cell clusters we provide a method utilizing XGBoost, a sparsity-aware gradient boosted tree system .
We designed scedar in an object-oriented manner for quickly exploring large-scale scRNA-seq transcription level matrices on a remote server utilizing parallel computing techniques, in order to provide a robust and extensible platform for EDA, rather than surpassing the analytical performance of any of ≥ 275 existing scRNA-seq data analysis methods . The application programming interface (API) is designed to be intuitive to users familiar with the R programming language. The standard analysis workflow is to explore the dataset, cluster cells, and identify cluster separating genes (Fig 1), which is implemented as four main modules: data structure, KNN, clustering and visualization.
The core data structure stores each transcription level matrix as a standalone sparse matrix or full array instance, and it can easily be extended to support customized analytical procedures. The built-in extensions include common procedures like pairwise distance computation, Principal Component Analysis (PCA), t-SNE , UMAP , and k-nearest neighbor graph . We optimized time and memory efficiency with the following design patterns: parallel computing, lazy loading, caching and copy-on-write.
The KNN and clustering modules utilize the data structure and parallel computing to efficiently perform analytical procedures, and the results are stored in the data structure for further reference.
The visualization module contains plotting methods optimized for large datasets, especially for plotting heatmaps. For example, it takes less than two minutes to generate a heatmap image from a 50,000 x 20,000 matrix containing random standard normal entries.
Preprocessing is implemented as selection and transformation routines of the core data structure, which is not a focus of scedar. The package is designed to identify the necessity of certain preprocessing procedures by extensively exploring the original state of the data. Although preprocessing, such as batch effect correction and normalization, could facilitate the detection of biological variances between cells, applying preprocessing methods correctly requires careful validation of their assumptions, otherwise they may introduce unwanted bias or variability .
Minimum description length (MDL) iteratively regulated agglomerative clustering (MIRAC) extends hierarchical agglomerative clustering (HAC)  in a divide and conquer manner for scRNA-seq data. Input with raw or dimensionality reduced scRNA-seq data, MIRAC starts with one round of bottom-up HAC to build a tree with optimal linear leaf ordering , and the tree is then divided into small sub-clusters, which are further merged iteratively into clusters. Because each individual cluster becomes more homogenous with higher number of clusters, the iterative merging process is regularized with the MDL principle . The asymptotic time complexity of the MIRAC algorithm is O(n4+mn2) where n is the number of samples, and m is the number of features. The space complexity is O(n2+mn ). Relevant mathematical theories and notations of MDL are briefly described in the following section. The pseudo-code of MIRAC is shown in Fig 2.
Comparing to HAC, MIRAC is not designed to be faster but rather to improve the cluster robustness. The asymptotic time complexity of MIRAC is the same as HAC with optimal leaf ordering. However, the use of MDL rather than deterministic similarity metrics, improves the noise tolerance by estimating similarity with probabilistic models to give more weight on signal and less weight on noise, assuming that the signal is stronger than noise.
The rationale behind MIRAC is to reduce the number of cell partitions that need to be evaluated and find a good one among them.
The number of all possible partitions of a set with n elements is the Bell number, which could be computed with the following recurrencewhere n≥1 and B0 = 1 . It is computationally intractable to compute the code lengths of Bn partitions, so we reduced the number of cell partitions to evaluate with the following steps:
In order to find a good cell partition Pg in the subset of all possibles, we iteratively inspect each cluster of Ps and merge it with either its left or right adjacent cluster according to the similarity determined by the two-stage MDL scheme described in the following sections. The procedure has the following steps:
The code length of divided is
If , merge and , otherwise split them. This conditional statement generalizes the situations where Lm is either non-negative or negative.
We use a standard HAC tree to perform MIRAC when the number of samples is larger than ten thousand, which usually results in decreased but still acceptable performances (Fig 3). Although the optimal leaf ordering of the HAC tree is an important assumption, its computation takes too long when the number of samples is large. For example, it takes more than a week to compute the optimal leaf ordering of a dataset with about 68,000 samples. However, the time complexity of computing a good HAC linear ordering could be significantly reduced by implementing other ordering techniques .
The time complexity of MIRAC is (n4+mn2), where n,m≥1. The time complexity of computing the HAC tree with optimal leaf ordering is O(n4 ) . The time complexity of finding Pg is O(mn2), which is briefly analyzed as following.
The time complexity is O(nm) for computing the code length L(Xn×m,P). In computing L(Xn×m,P) we compute the code lengths of each cluster and their labels. Since it takes O(n) time to compute the code length of n observations with a single feature, computing the code length of all clusters and features individually takes O(nm) time, and computing the code length of n labels take O(n) time.
Similarly, the time complexity is O(n1m+n2m) for computing the code length of encoded by a model fitted with . Fitting a model on takes O(n2m) amount of time. Using the model to encode takes O(n1m) amount of time.
The time complexity of searching for Pg is O(mn2). In the worst-case scenario, |Ps| = n, and the inspecting cluster is always merged with the left-hand side cluster in step 4. However, the impact of and step 3 on the time complexity is not straightforward, so we divide the step 3 code length computation into three parts and analyze the upper bound of them individually. The divided three parts are the left cluster, inspecting cluster, and right minimax cluster. Because we keep increasing the size of the left cluster, the upper bound of computing the left cluster code length throughout the execution is O(mn2). The maximum size of the inspecting cluster is , so the upper bound of computing the inspecting cluster code length is also O(mn2), when every middle singleton is evaluated times before merging with the left. The maximum size of the right minimax cluster is , so the upper bound of evaluating the right minimax cluster is also O(mn2), when every increment of the left cluster takes times of evaluating the right minimax cluster. Summarizing these three parts, the overall upper bound is therefore O(mn2).
The space complexity of MIRAC is O(nm+n2), which could be decomposed into the following three parts. The space complexity of storing the n×m data matrix X is O(mn). The space complexity of storing the HAC tree with optimal leaf ordering is O(n). The space complexity of storing the pairwise distance matrix is O(n2).
We extended MIRAC with community detection to improve scalability. We apply MIRAC on a relatively large number of detected single-cell communities to identify final clusters. We used the Leiden algorithm for community detection on KNN graphs . We also provide a KNN graph construction method that supports approximate nearest neighbor (ANN) search using a Hierarchical Navigable Small World (HNSW) graph .
We use XGBoost , a scalable and sparsity-aware boosted tree system, to identify genes that are able to separate a specific set of clusters. This method is designed for data exploration after applying any one of a number of various statistical approaches that have been developed to identify differentially expressed genes [31–35], in order to quickly identify genes or sets of genes that are able to separate an arbitrary set of clusters for further inspection.
Rather than providing a meticulous p-value for each gene among the compared clusters, we rank the genes by their individual importance on separating the clusters under comparison. The importance of a gene in the trained classifier is the number of times it has been used as an inner node of the decision trees. We use cross validation to train an XGBoost classifier on the compared clusters, and the classifier is essentially a bag of decision trees . In order to alleviate the influences of stochasticity on interpretation, we use bootstrap with feature shuffling to better estimate the importance of genes in separating the compared clusters. The obtained list of important genes could be further explored by inspecting transcription level fold changes and decision tree structures.
Comparing to NSForest , a method based on random forest [37,38] to identify a parsimonious set of cluster separating genes from scRNA-seq data, our method identifies all possible cluster separating genes using gradient boosting. Practically, NSForest version 1.3 is distributed on GitHub as a Python script without encapsulation or testing (checked on Oct 11, 2018), but our method is distributed through the Python Package Index, comprehensively tested, and easy to use through the user-friendly API. With regard to scalability, our method uses a scalable implementation of gradient boosting algorithm , whereas NSForest uses the implementation of random forest in scikit-learn .
A k-nearest neighbor analytical strategy exploits the similarity between data points for classification, regression, and imputation [39,40]. In k-nearest neighbor methods, samples are considered as points in a space with each dimension representing a measured property, which is often referred to as a feature, of the samples. The similarity between samples can be evaluated with various distance metrics, which is extensively reviewed by Bellet et al . . Generally, a distance metric is a function that takes two samples and output a numeric value to represent the distance between the two samples in their feature space. For example, the Euclidean distance metric is the following function,where the p and q are two samples, and qi and pi are the data values on their ith dimension. The k-nearest neighbors of a sample are the k number of other samples that have the smallest distances from the sample. The k-nearest neighbors of a sample are informative, due to their similarity, to determine the category of the sample in classification, the relevant continuous property in regression, and the missing values in imputation.
We developed two methods based on the KNN algorithm to facilitate the exploration of scRNA-seq datasets. For a relatively large number of cells profiled in each scRNA-seq experiment, we assume that each one of the non-rare cells is similar to at least k other cells in their transcriptomic profiles. With this assumption, we impute gene dropouts and detect rare transcriptomic profiles. In the scedar implementation, we also support approximate nearest neighbor (ANN) search using Hierarchical Navigable Small World (HNSW) graph , which greatly improves the scalability of KNN graph construction.
In an scRNA-seq experiment, if a truly expressed gene is not detected in a cell, the gene is considered a “dropout”, and such events are called gene dropouts . Gene dropouts may be caused by biological and technical reasons [32,42,43]. The rationale behind the possible causes of biological dropouts are related to phenomena such as transcriptional bursting [44,45] and RNA degradation. With regard to technical dropouts, the main concerns are the relatively small number of RNA transcripts of a gene, amplification efficiency, and batch effects .
We exploit the transcriptomic profiles of the k-nearest neighbors of a cell to impute the gene dropouts in the cell (Algorithm A in S1 Text). The algorithm could take multiple iterations, so that the dropped-out genes that are expressed in all k-nearest neighbors could be imputed at first, and the ones that are expressed in most but not all of k-nearest neighbors could be imputed in the following iterations.
We mark transcriptomic profiles as rare if they are distinct from their k-nearest neighbors, according to the pairwise similarity between cells (Algorithm B in S1 Text). The algorithm could take multiple iterations, so that the most distinct transcriptomic profiles could be marked at first and less distinct ones in the following iterations.
This method is provided mainly to facilitate detailed inspection of rare transcriptomic profiles rather than removing outliers from the data. Because rare transcriptomic profiles may have various biological and technical causes, samples and features in a dataset should only be removed after extensive exploratory data analysis and rigorous reasoning with domain specific knowledge. Closely comparing rare transcriptomic profiles with their nearest neighbors may also yield insights into their biological differences, which may further facilitate the identification of rare cell types and states.
We benchmarked the clustering and KNN performances of scedar on simulated and experimental scRNA-seq datasets. We obtained previously published experimental scRNA-seq datasets (Table 1). We also generated 50 simulated scRNA-seq read count datasets with Splatter . The simulation parameters are estimated by Splatter according to a Drop-seq dataset  and a 10x Genomics GemCode dataset . Within each simulated dataset, the cells have 8 clusters taking approximately the following proportions: ⟨0.3, 0.2, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05⟩, with a gene dropout rate around 5%.
|Publication||# cells||# genes||Organism||Tissue||Protocol||Raw Data Type|
|Deng et al. (2014)||268||22,431||Mus Musculus||Embryo||Smart-Seq/Smart-Seq2||Read count|
|Pollen et al. (2014)||301||23,730||Homo sapiens||Dermal, blood, pluripotent and neural||SMARTer||TPM|
|Kolodziejczyk et al. (2015)||704||38,653||Mus Musculus||Embryonic stem cell||SMARTer||Read count|
|Zeisel et al. (2015)||3005||19,972||Mus Musculus||Brain||STRT-Seq UMI||Read count|
|Macosko et al. (2015)||44,808||23,288||Mus Musculus||Retina||Drop-seq||Read count|
|Zheng et al. (2015)||68,579||33,694||Homo sapiens||Peripheral blood mononuclear cell||GemCode||Read count|
|Han et al. (2018)||405,191||39,855||Mus Musculus||Various tissues||Microwell-Seq||Read count|
|Cao et al. (2019)||2,058,652||26,183||Mus Musculus||Embryo||sci-RNA-seq3||Read count|
We performed all benchmark analyses on a high-performance computing cluster, of which the computing resources are strictly managed by Univa Grid Engine. The cluster nodes have CPUs of Intel Xeon E5-2680 v3 or Intel Xeon E7-8880 v3. The memory sizes are either 128GB, 256GB, or 1TB. When scheduling analytical jobs for benchmarking, we make sure that the number of cores and allocated memory are enough for the program.
The clustering accuracy and stability of MIRAC were benchmarked together with several other clustering methods on experimental scRNA-seq datasets listed in Table 1.
The following clustering methods are directly applied on the original data without preprocessing.
Although MIRAC could be directly applied on the expression matrix, dimensionality reduction is able to improve the performance of similarity and density-based clustering methods when the number of features is high . The mathematical influences of the high number of features are briefly described in the previous sections.
Although t-SNE projections are stochastic and influenced by the perplexity parameter, t-SNE is proved to be able to recover well-separated clusters  and t-SNE has been extensively used as dimensionality reduction method for scRNA-seq data [3,15]. Thus, we chose to use t-SNE for demonstration in this report, but we note that scedar also supports the use of PCA and UMAP for MIRAC clustering, which can be applied just as easily as the t-SNE method.
When benchmarking for accuracy, we cluster the cells in each experimental dataset using the compared clustering methods with a grid of parameters. We use the maximum similarities between the clustering results and the cell types from the publications in order to compare the accuracy of different clustering methods. Although taking the maximum increases the chance of overfitting, it resembles the procedure of clustering analysis in practice.
We also recorded the runtime of clustering methods on different datasets. In Fig 3A, SC3 was performed with 20 cores, and community clustering is performed with 60 cores. Other clustering methods were performed with a single core. When running MIRAC, we did not require the hierarchical tree to be optimal. The community-detection-extended MIRAC clustering method was performed with 25 CPU cores, and other methods were performed with 60 CPU cores (S8 Fig, panel C).
We also benchmarked the influences of t-SNE random states on the clustering results (S1 Fig). The t-SNE embeddings generated with different random states may be distinct from each other, although current mathematical results have shown that clustering using t-SNE embeddings is able to recover well separated clusters . We characterized the stability to random states of clustering methods by running them on each experimental dataset with the same parameters but ten different random states. The similarity of clustering results between different random states are used to compare the stability of different clustering methods. The stability of community clustering is not characterized, because it is based on PCA.
We use two cluster similarity metrics, cluster consistent ratio (CCR) and adjusted Rand index (ARI)  for measuring the accuracy and stability of clustering methods respectively. When we have a coarse reference partition Pr and a finer clustering partition Pc, the CCR is computed as the ratio of pairs within each cluster of Pc that are also in the same cluster of Pr , with the number of clusters kept the same across compared methods. The ARI is computed with the Python package scikit-learn  using the mathematical formula given by Hubert and Arabie .
The reference partitions Pr of real datasets are obtained from their original publications (Table A in S1 Text). The clusters in Deng et al . dataset  are experimentally determined by manual single cell isolation. The clusters in Pollen et al .  are experimentally determined by the cell line or tissue type of the isolated single cells. The clusters in Kolodziejczyk et al .  are determined by the embryonic stem cell culturing conditions and batches. The clusters in Zeisel et al .  are determined by the BackSPIN clustering method  and further inspected with domain specific knowledge.
We choose CCR to measure clustering accuracy rather than ARI, because ARI greatly penalizes the split of a large cluster in Pr into multiple smaller ones in Pc. This behavior of ARI could prevent the evaluation of transcriptomic variabilities within a large group of cells in clustering analysis, which is an important goal of scRNA-seq experiments that cannot be easily achieved by bulk RNA-seq experiments. In addition, we also provide a method in scedar to easily merge multiple clusters together, in case the users found the sub-types of a cell type are very similar to each other.
However, we used ARI to measure clustering stability rather than CCR, because the differences between Pr and Pc are completely caused by different random states, hence splitting a cluster in Pr should be penalized.
MDL is not used as a cluster similarity metric, even though MDL is used in MIRAC to guide the process of finding good partitions. MDL is only used to guide and regularize the merging process of local adjacent sub-clusters by evaluating their similarity, but it is not used as an objective to be optimized globally. Using MDL as a benchmarking metric would introduce a bias as among the various methods, MIRAC is the only method using MDL. Moreover, interpreting relative global MDL differences is not straightforward as it is not only affected by cluster labels, but also transcription levels.
We visualize real datasets before and after removing rare transcriptomic profiles in t-SNE projection and pairwise distance heatmaps, without quantitative evaluations like receiver operating characteristic (ROC) curve, since rare transcriptomic profiles are not well defined with a large number of genes . Approaches to identify rare data points in low-dimensional spaces (≤3) do not scale well to higher dimensions due to the exponentially decreased density of data points in space and increased instability of distances, which is elaborated in the previous section about mathematical theories on high-dimensional data analysis. KNN rare transcriptomic profile detection was performed with 25 CPU cores, as shown in S8 Fig panel B.
It is important to note that rare transcriptomic profiles are detected to facilitate detailed inspection rather than the removal of them from the data. The visualizations are only used to illustrate the capability of the KNN method for detecting rare transcriptomic profiles.
We simulate gene dropouts using Splatter  to obtain a dropout rate around 5%. Then, we benchmark the performance of imputing gene dropout as two parts: detection and smoothing. On simulated data, because the true gene dropouts are known, we use a ROC curve and mean squared errors (MSEs) to characterize the performance of gene dropout detection and smoothing respectively. In Fig 4A, S7 Fig panel A, and S8 Fig panel A, all imputation methods were executed with a single core. On real data, we visualize the cells in 2D t-SNE space before and after imputation. The compared methods are KNN gene dropout imputation (KNNGDI) version 0.1.5, SAVER version 1.0.0 , MAGIC version 1.1.0 , and scImpute version 0.0.6 .
We characterized the analysis speed-up by parallel processing on different datasets (S9 Fig). KNN dropout imputation is not evaluated on larger datasets, because the speed limiting step has not yet been parallelized. In addition to parallelizing the computation in the implemented methods, we also provide an easy-to-use utility function (scedar.utils.parmap) to parallelly run analytical methods with different parameters, which would significantly facilitate exploratory parameter search.
We illustrate the basic workflow of using scedar for scRNA-seq exploratory data analysis with the dataset published by Zeisel et al .  (Fig 1). The dataset contains the RNA unique molecule identifier (UMI) counts of 19,972 genes in 3005 cells from the mouse brain. We selected this dataset for demonstration because it could be clearly visualized in small graphs, although our package is capable of analyzing much larger datasets with over 2 million cells (S2 Fig, S3 Fig, S4 Fig and S5 Fig).
In Fig 1, each box represents a data analysis step, and they are consecutively executed according to the arrow. The purpose and runtime are listed in the upper ribbon. The code that is essential to the step is listed in the box, and their results are also shown.
The preparation step imports required packages and loads the data. The class SampleDistanceMatrix is one of the core data structures in the package that is used to store the read counts and pairwise distances of an scRNA-seq dataset. Because the pairwise distance computation is delayed until necessary, i.e. lazily loaded, this step only takes 12 seconds. We use cosine distance rather than correlation distance to greatly speed up the computation, since we implemented the computation procedure of pairwise cosine distances completely with NumPy linear algebra operations with OpenBLAS backend [57,58].
The t-Distributed Stochastic Neighbor Embedding (t-SNE) scatter plot and KNN graph are used to explore the dataset. The cell type labels published by Zeisel et al .  are truncated to fit in the space. We also provide methods to visualize arbitrary statistics, e.g. number of expressed genes, of individual cells as color gradient. The layouts of cells in t-SNE and KNN graph are similar to each other. Although KNN graph is faster than t-SNE, the runtime for t-SNE could be reduced by optimizing its parameters, e.g. lowering the number of iterations.
The MIRAC step clusters the cells and visualizes them with t-SNE scatter plot and pairwise distance matrix heatmap. The heatmap generation procedure in scedar is optimized for large-scale datasets, which is able to generate a heatmap with tens of thousands of columns and rows in a few minutes. Users could also generate heatmaps for the read count matrix to directly inspect the sparsity of datasets (all panels B in S2 Fig, S3 Fig, S4 Fig and S5 Fig).
The last step identifies cluster separating genes with XGBoost . Users could choose an arbitrary set of clusters to compare, and the genes are ranked by their importance in separating the clusters. Then, the read counts of a gene across clustered labels could easily be visualized by t-SNE scatter plot.
We benchmarked several clustering methods on the datasets listed in Table 1 (Fig 3). Each dataset is clustered multiple times with each clustering method to obtain different numbers of clusters (Table A in S1 Text).
The t-SNE based clustering methods are faster than SC3 (Fig 3A). Also, the t-SNE based clustering methods have similar runtimes (Fig 3A), since the time limiting step is the computation of t-SNE projection, when optimal hierarchical clustering ordering is not required in MIRAC. When the optimal ordering is required, the time limiting step is the computation of optimal ordering (Fig 1). With regard to practical usage, we recommend the users to explore different parameters of MIRAC without optimal ordering. Once appropriate parameters have been identified, users can perform MIRAC with optimal ordering. However, for datasets with more than 10,000 cells, we recommend that users perform MIRAC without optimal ordering (S2 Fig panel C and S3 Fig panel C). For datasets with more than 100,000 cells, we recommend that users perform community-detection-extended MIRAC without optimal ordering (S8C Fig), which takes for instance 2885.5 ± 176.0 (mean ± standard deviation) seconds on the mouse organogenesis cell atlas (MOCA) dataset with 2,058,652 single cells using 25 CPU cores (S8C Fig) . We also performed clustering with the community clustering methods implemented in scedar, Seurat  and Scanpy  on relatively large scRNA-seq datasets with more than 10,000 cells (S8C Fig). The runtime of community clustering methods Seurat and Scanpy is comparable to community clustering and community-detection-extended MIRAC (S8C Fig), except that Scanpy and Seurat were not able to cluster the MOCA dataset that contain 2,058,652 single cells on a server with 1TB memory due to a mandatory conversion of the sparse read count matrix into dense matrix.
The cluster consistent ratios (CCRs) of t-SNE based clustering methods are comparable to SC3 (Fig 3B). The PCA based community clustering results showed relatively lower CCRs than other clustering methods (Fig 3B), which might due to the non-optimal representation of similarities between high-dimensional single cell read counts in the linear PCA space, comparing to t-SNE embeddings that are optimized to capture the similarities between high-dimensional single cell read counts. The representative MIRAC clustering results of Zheng et al .  and Macosko et al .  datasets are visualized with t-SNE scatter plots and pairwise distance matrix heatmaps (S2 Fig panel C and S3 Fig panel C). For smaller datasets, the representative MIRAC results are shown (S6 Fig). Although t-SNE projections obtained with different random states are distinct from each other, the consistency of clustering results is comparable to SC3 (S1 Fig).
We benchmarked several gene dropout imputation methods on the simulated 10x Genomics (Fig 4) and Drop-seq (S7 Fig) datasets. K-nearest neighbor gene dropout imputation (KNNGDI) is faster than other compared methods on relatively small datasets (Fig 4A) and is capable of handling scRNA-seq datasets with over 10,000 cells (S8 Fig panel A). The runtime of KNNDGI on 2,058,652 cells is 71.56 ± 1.71 (mean ± standard deviation) hours.
The performance of KNNGDI on detecting gene dropouts is comparable to SAVER and better than scImpute and MAGIC. The ROC curve of scImpute sharply turns around TPR = 0.25 (Fig 4B and S7 Fig panel B) because its threshold parameters, determining whether a zero entry is a dropout or not, are not sensitive enough to achieve any higher TPRs. Although the AUCs of KNNGDI and SAVER are higher than scImpute and MAGIC, they all have comparable performances when FPRs are lower than 0.05 (Fig 4B), except that MAGIC has worse performance on the simulated Drop-seq datasets that have higher sparsity than the Zeisel et al .  dataset (S7 Fig panel B).
The MSEs of KNNGDI on correcting gene dropouts are comparable to SAVER and smaller than scImpute and MAGIC (S7 Fig panel C and S7 Fig panel D). However, these methods all have MSEs many times higher than the MSEs of the observed counts to the true counts, which implies that these methods all introduced many times more reads than the true dropout reads. Despite different levels of MSEs, the t-SNE embeddings of the imputed read counts are consistent with the t-SNE embeddings of original read counts (Fig 4C, and all panels E in S2 Fig, S3 Fig, S4 Fig and S5 Fig).
We detected rare transcriptomic profiles in datasets listed in Table 1 with the KNN method (Fig 5, and all panels F in S2 Fig, S3 Fig, S4 Fig, S5 Fig and S6 Fig) Although the detection method has many limitations, rare transcriptomic profiles tend to be points on the t-SNE scatter plots either far away from the majority of their same types or along the edges of a group of agglomerated points. On the pairwise distance matrix heatmap, rare transcriptomic profiles tend to be small chunks that are distinct from their neighbors, and the heatmap becomes smoother after removing the rare transcriptomic profiles. The detected rare transcriptomic profiles could be further inspected as potential rare cell types and states by comparing with their nearest neighbors. The KNN rare transcriptomic profile detection method is also capable of handling scRNA-seq datasets with over 10,000 cells (S8 Fig panel B), with a runtime of 22.14 ± 0.87 (mean ± standard deviation) minutes on 2,058,652 cells.
We used scedar to identify the genes distinguishing the MIRAC cluster 1, 15, and 22 of the Zeisel et al .  dataset (Fig 6 and Table B in S1 Text). In the original publication, the upper to lower MIRAC clusters labeled 22, 1, and 15 in the t-SNE scatter plot are assigned to microglia, endothelial cells, and astrocytes respectively (Fig 4C). We choose these three clusters to inspect one of the discrepancies between cell types and MIRAC clustering results, where the small isolated upper part of the MIRAC cluster 15 is assigned to microglia instead of endothelial cells in the original publication.
The smaller upper isolated part of MIRAC cluster 15 might be a distinct cell sub-type of microglia. Although it expresses a microglia marker gene C1qb (Fig 6B) , it does not express Mrc1 or Apoe in the same pattern as the MIRAC cluster 22 (Fig 6B and 6C). According to the transcription levels of the cluster separating genes (Fig 6C), the smaller upper isolated part of MIRAC cluster 15, which is located at the top of the cluster 15 rows, has some genes expressed in the same pattern as the microglia, but it also has some other genes expressed distinctly from the microglia.
Comprehensive profiling of the transcriptomes of individual cells within an organism or tissue by scRNA-seq is able to facilitate the systematic study of physiological or pathological states [3,61,62]. Previous scRNA-seq experiments identified novel cell types or states , obtained insights into the regulation of differentiation process [3,59,63,64], and inferred molecular mechanisms of tissue functions [53,65,66].
The biological results of scRNA-seq experiments are obtained from extensive data analyses, which could take more time than doing the experiments. As the size of scRNA-seq datasets increases rapidly [1,67], computational methods also need to be scalable by exploiting hardware and software techniques for big data analytics, in order to complete an analysis within a reasonable amount of time.
Scedar is able to facilitate gene dropout imputation, rare transcriptomic profile detection, clustering, and identification of cluster separating genes for scRNA-seq data by exploiting scalable system design patterns and high-performance computing architectures. We parallelized time-consuming computational procedures by multi-processing without excessive copies of shared data in memory. We also decompose the whole analytical procedure into multiple steps, so that certain steps could be specifically optimized without repeating others. In addition, intermediate results, such as pairwise distances and t-SNE projections, are lazy loaded and cached in a unified data structure to speed up analytical routines, prevent repeated computations, and alleviate the burden on users to keep track of all intermediate results.
Comparing to other computational tools that were developed or updated for large scale scRNA-seq data analysis, like PAGODA2 , Seurat v2.0 , and Scanpy , scedar distinguishes itself with an additional research focus on developing new analytical methods, including those based on machine learning. In scedar, we adapted KNN, a typical machine learning algorithm, to impute gene dropouts and detect rare transcriptomic profiles. MDL principle, an important concept in computational learning, is applied to cluster single cells . A scalable and sparsity-aware gradient boosted tree system XGBoost , which implements a typical machine learning algorithm, is used to identify genes that are able to distinguish different clusters. In addition, these machine learning methods are able to exploit modern high-performance computing architecture, which improves the scalability of the package.
In scedar, we developed a clustering algorithm, MIRAC, for scRNA-seq data. MIRAC clusters observations in three steps: 1) build a tree by hierarchical clustering, 2) divide the tree into sub-trees, and 3) merge similar sub-trees into individual clusters. This clustering strategy adapts the basic ideas of BIRCH  and BackSPIN . Instead of building a balanced tree structure, like the clustering feature tree in BIRCH for further partitioning, MIRAC divides the tree structure built by hierarchical clustering optionally in a balanced manner (S10 Fig and S1 Text section 1.4), which simplifies the clustering procedure. In BackSPIN, a sorted pairwise correlation matrix is recursively bi-partitioned into clusters according to criteria based on the normalized sums of correlation coefficients. In contrast, MIRAC iteratively merges the relatively small sub-clusters according to criteria based on the MDL principle, which increases the robustness for determining whether two groups of observations should be put in the same cluster or not. Especially, when the number of observations is large within a group, finding an optimal bi-partition is not straightforward, since there may be multiple distinct sub-groups. Although the performance of MIRAC under certain metrics is comparable to other clustering algorithms, it is able to provide distinct clusters that are sensitive to local structures, which could be used as alternative perspectives to interpret the source of heterogeneity within the dataset.
There are still many possible improvements on scedar. To improve the scalability of MIRAC, we could provide more efficient methods to obtain the optimal leaf ordering using linear embedding techniques . To improve the scalability of the backend data structure, we could extend it with distributed analytic systems such as Apache Spark . To visualize the differences between different clusters, we could plot the fold changes of gene transcription levels on the pathway maps in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database .
Scedar is at its core developed as an open-source, community-extensible package for exploratory data analysis, with strong testing and code development standards (Table C in S1 Text). Scedar will continue to be developed in an open-source environment that can offer an analytical platform for developing novel algorithms, support state-of-the-art software quality control standards, and provide easy-to-use programming interfaces for exploratory data research.