--- title: | "ScRNA-seq workflow CONCLUS: from CONsensus CLUSters to a meaningful CONCLUSion" author: | "Ilyess Rachedi, Polina Pavlovich, Nicolas Descostes, and Christophe Lancrin" output: BiocStyle::pdf_document: default vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{conclus} --- # Introduction CONCLUS is a tool for robust clustering and positive marker features selection of single-cell RNA-seq (sc-RNA-seq) datasets. It is designed to identify rare cells populations by consensus clustering. Of note, CONCLUS does not cover the preprocessing steps of sequencing files obtained following next-generation sequencing. You can find a good resource to start with [here](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html). CONCLUS is organized into the following steps: + Generation of multiple t-SNE plots with a range of parameters including different selection of genes extracted from PCA. + Use the Density-based spatial clustering of applications with noise (DBSCAN) algorithm for idenfication of clusters in each generated t-SNE plot. + All DBSCAN results are combined into a cell similarity matrix. + The cell similarity matrix is used to define "CONSENSUS" clusters conserved accross the previously defined clustering solutions. + Identify marker genes for each concensus cluster. # Getting help Issues can be submitted directly to the Bioconductor forum using the keyword 'conclus' in the post title. To contact us directly write to christophe.lancrin@embl.it or rachedi.ilyess@gmail.com. The principles of this package were originally developed by Polina Pavlovich who is now doing her Ph.D at the Max Planck Institute of Immunobiology and Epigenetics. # Important note Due to the stochastic aspect of tSNE, everything was generated with a random seed of 42 (default parameter of `generateTSNECoordinates`) to ensure reproducibility of the results. Because of the evolution of biomaRt, you might get slightly different figures. However, the marker genes for each cluster should be the same. The package is currently limited to mouse and human. Other organisms can be added on demand. Write to christophe.lancrin@embl.it or rachedi.ilyess@gmail.com.Priority will be given to model organisms. # Standard workflow ## Quick start CONCLUS requires to start with a raw-count matrix with reads or unique molecular identifiers (UMIs). The columns of the count matrix must contain cells and the rows -- genes. CONCLUS needs a large number of cells to collect statistics, we recommend using CONCLUS if you have at least 100 cells. In the example below, a small toy example is used to illustrate the runCONCLUS method. Real data are used later in this vignette. ```{r runConclus_first, eval=FALSE} library(conclus) outputDirectory <- tempdir() experimentName <- "Test" species <- "mouse" countmatrixPath <- system.file("extdata/countMatrix.tsv", package="conclus") countMatrix <- loadDataOrMatrix(file=countmatrixPath, type="countMatrix", ignoreCellNumber=TRUE) coldataPath <- system.file("extdata/colData.tsv", package="conclus") columnsMetaData <- loadDataOrMatrix(file=coldataPath, type="coldata", columnID="cell_ID") sceObjectCONCLUS <- runCONCLUS(outputDirectory, experimentName, countMatrix, species, columnsMetaData=columnsMetaData, perplexities=c(2,3), tSNENb=1, PCs=c(4,5,6,7,8,9,10), epsilon=c(380, 390, 400), minPoints=c(2,3), clusterNumber=4) ``` In your "outputDirectory", the sub-folder `pictures` contains all tSNE with dbscan coloration (sub-folder `tSNE_pictures`), the cell similarity matrix ( `Test_cells_correlation_X_clusters.pdf`), the cell heatmap (`Test_clustersX_meanCenteredTRUE_orderClustersFALSE_orderGenesFALSE markersPerCluster.pdf`), and the cluster similarity matrix (`Test_clusters_similarity_10_clusters.pdf`). You will also find in the sub-folder `Results`: + `1_MatrixInfo`: The normalized count matrix and its meta-data for both rows and columns. + `2_TSNECoordinates`: The tSNE coordinates for each parameter of principal components (PCs) and perplexities. + `3_dbScan`: The different clusters given by DBscan according to different parameters. Each file gives a cluster number for each cell. + `4_CellSimilarityMatrix`: The matrix underlying the cells similarity heatmap. + `5_ClusterSimilarityMatrix`: The matrix underlying the clusters similarity heatmap. + `6_ConclusResult`: A table containing the result of the consensus clustering. This table contains two columns: clusters-cells. + `7_fullMarkers`: Files containing markers for each cluster, defined by the consensus clustering. + `8_TopMarkers`: Files containing the top 10 markers for each cluster. + `9_genesInfos`: Files containing gene information for the top markers defined in the previous folder. Further details about how all these results are generated can be found below. ## Data In this vignette, we demonstrate how to use CONCLUS on a sc-RNA-seq dataset from [Shvartsman *et al*](https://www.biorxiv.org/content/10.1101/774547v1). The Yolk Sac (YS) and Aorta-Gonad-Mesonephros (AGM) are two major haematopoietic regions during embryonic development. Interestingly, AGM is the only one generating haematopoietic stem cells (HSCs). To identify the difference between AGM and YS, Shvartsman *et al* compared them using single cell RNA sequencing between 9.5 and 11.5 days of mouse embryonic development. This vignette aims at reproducing the identification of a rare population corresponding to liver like cells in the YS at day 11.5. The number of clusters used in this vignette is lower than in the original article for sake of simplification. This vignette neither provides a description of the differences between the AGM and YS. Please refer to Shvartsman *et al* for a complete description. Since endothelial cells represent a small population in these tissues, cells expressing the endothelial marker VE-Cadherin (VE-Cad, also called Cdh5) were enriched and sorted, VE-Cad Negative cells constituting the microenvironment. Therefore, four cell states are defined by the cell barcodes: YS VE-Cad minus, AGM VE-Cad minus, YS VE-Cad minus, and AGM VE-Cad minus. The E11.5 data are constituted of 1482 cells. After filtering and normalization, 1303 cells were retained. This sc-RNA-seq experiment was performed using the SMARTer ICELL8 Single-Cell System ([Click here for more info](https://www.takarabio.com/learning-centers/automation-systems/icell8-introduction/icell8-technology-overview)). The protocol was based on 3' end RNA sequencing where each mRNA molecule is labeled with a unique molecular identifier (UMI) during reverse transcription in every single cell. Shvartsman *et al* deleted highly abundant haemoglobins having names starting with ‘Hba’ or ‘Hbb’ because they seemed to be the primary source of contamination. Additionally, they excluded poorly annotated genes that did not have gene symbols to improve the clusters annotation process. Finally, the human cell controls were removed. ### Retrieving data from GEO The code below format the count matrix and the columns meta-data. The source data are downloaded from the GEO page [GSE161933](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4923493). The URL of the count matrix and the cells meta-data were retrieved by right-click and 'copy link adress' on the http hyperlink of the supplementary file at the bottom of the page. If the columns metadata are not available in the supplementary file section, the name of the series matrix, containing columns meta-data can be retrieved by clicking the 'Series Matrix File(s)' link just above the count matrix. The function `retrieveFromGEO` used with the parameter seriesMatrixName will download all series matrices present in the GEO record however only the one of interest will be kept. ```{r format_data} library(conclus) outputDirectory <- tempdir() dir.create(outputDirectory, showWarnings=FALSE) species <- "mouse" countMatrixPath <- file.path(outputDirectory, "countmatrix.txt") metaDataPath <- file.path(outputDirectory, "metaData.txt") matrixURL <- paste0("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM492", "3493&format=file&file=GSM4923493%5FE11%5F5%5Frawcounts%5Fmatrix%2Etsv%2Egz") metaDataURL <- paste0("https://www.ncbi.nlm.nih.gov/geo/download/?acc=", "GSM4923493&format=file&file=GSM4923493%5FMetadata%5FE11%5F5%2Etxt%2Egz") result <- retrieveFromGEO(matrixURL, countMatrixPath, species, metaDataPath=metaDataPath, colMetaDataURL=metaDataURL) countMatrix <- result[[1]] columnsMetaData <- result[[2]] ## Correct the columns names to fit conclus input requirement ## The columns should be: cellName, state, and cellBarcode columnsMetaData <- columnsMetaData[,c(1,3,2)] colnames(columnsMetaData) <- c("cellName", "state", "cellBarcode") ## Removing embryonic hemoglobins with names starting with "Hba" or "Hbb" idxHba <- grep("Hba", rownames(countMatrix)) idxHbb <- grep("Hbb", rownames(countMatrix)) countMatrix <- countMatrix[-c(idxHba, idxHbb),] ## Removing control human cells idxHumanPos <- which(columnsMetaData$state == "Pos Ctrl") idxHumanNeg <- which(columnsMetaData$state == "Neg Ctrl") columnsMetaData <- columnsMetaData[-c(idxHumanPos, idxHumanNeg),] countMatrix <- countMatrix[,-c(idxHumanPos, idxHumanNeg)] ## Removing genes not having an official symbol idxENS <- grep("ENSMUSG", rownames(countMatrix)) countMatrix <- countMatrix[-idxENS,] ``` ### Using local data If you executed the code above, a matrix path is stored in `countMatrixPath` and a file path to the meta-data is stored in `metaDataPath`. Just indicate the paths to your files in these two variables. Please also note that to load the metadata, you need to fill in the `columnID` argument with the name of the column containing the names of the cells. These names must be the same as those in the count matrix. ```{r local_data, eval=FALSE} ## countMatrixPath <- "" ## metaDataPath <- "" countMatrix <- loadDataOrMatrix(countMatrixPath, type="countMatrix") columnsMetaData <- loadDataOrMatrix(file=metaDataPath, type="coldata", columnID="") ## Filtering steps to add here as performed above ``` ## Test clustering The *TestClustering* function runs one clustering round out of the 84 (default) rounds that CONCLUS normally performs. This step can be useful to determine if the default DBSCAN parameters are suitable for your dataset. By default, they are *dbscanEpsilon = c(1.3, 1.4, 1.5)* and *minPts = c(3,4)*. If the dashed horizontal line in the k-NN distance plot lays on the “knee” of the curve (as shown below), it means that optimal epsilon is equal to the intersection of the line to the y-axis. In our example, optimal epsilon is 1.4 for 5-NN distance where 5 corresponds to MinPts. In the "test_clustering" folder under outputDirectory, the three plots below will be saved where one corresponds to the “distance_graph.pdf”, another one to "test_tSNE.pdf" (*p[[1]]*), and the last one will be saved as "test_clustering.pdf" (*p[[2]]*). ```{r normalization_1} ## Creation of the single-cell RNA-Seq object scr <- singlecellRNAseq(experimentName = "E11_5", countMatrix = countMatrix, species = "mouse", outputDirectory = outputDirectory) ## Normalization of the count matrix scr <- normaliseCountMatrix(scr, coldata=columnsMetaData) ``` ```{r testClustering, results="hide", eval=FALSE} p <- testClustering(scr, writeOutput=TRUE, silent=TRUE) ``` ![If the dashed horizontal line in the k-NN distance plot lays on the “knee” of the curve, it means that optimal epsilon is equal to the intersection of the line to the y-axis.](figures/1_distancegraph.png){width="344px" height="350px"} ```{r testClustering_result1, results="hide", fig.keep = 'none', eval=FALSE} # saved as "outputDirectory/test_clustering/test_tSNE.pdf" p[[1]] ``` ![One of the 14 tSNE (by default) generated by conclus.](figures/2_tsne.png){width="369px" height="366px"} ```{r testClustering_result2, results="hide", fig.keep = 'none', eval=FALSE} # saved as "outputDirectory/test_clustering/test_clustering.pdf" p[[2]] ``` ![One of the 84 dbscan clustering solutions generated by conclus.](figures/3_clusterplot.png){width="369px" height="366px"} # CONCLUS step by step The wrapper function runCONCLUS is organized into 7 steps: + Normalization of the counts matrix + Generation of t-SNE coordinates + Clustering with DB-SCAN + Cell and cluster similarity matrix calculation + Plotting + Marker genes identification + Results export ## Normalization of the counts matrix sc-RNA-seq datasets are quite challenging notably because of sparsity (many genes are not detected consistently yielding expression matrices with many zeroes) and also because of technical noise. To facilitate analysis, one needs to perform a step of normalization which allows for the correction of unwanted technical and biological noises (click [here]( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5549838/) for a complete review on normalization techniques). CONCLUS uses [Scran](https://bioconductor.org/packages/release/bioc/html/scran.html) and [Scater](http://www.bioconductor.org/packages/release/bioc/html/scater.html) packages for normalization. Beforehand, the function will annotate genes creating *rowData* and add statistics about cells into *columnsMetaData*. If you already have *columnsMetaData* and *rowData*, you can give it to the function (see manual). It will keep your columns and add new ones at the end. If you do not want to lose any cell after quality metrics check, select *alreadyCellFiltered = TRUE*, by default it is *FALSE*. Before *scran* and *scater* normalization, the function will call *scran::quickCluster* (see manual for details). If you want to skip this step, set *runQuickCluster = FALSE*, by default it is *TRUE*. We advice to first try the analysis with this option and to set it to FALSE if no rare populations are found. ```{r normalization_2, eval=FALSE} scr <- normaliseCountMatrix(scr, coldata=columnsMetaData) ``` The method *normaliseCountMatrix* returns a scRNASeq object with its sceNorm slot updated. This slot contains a SingleCellExperiment object having the normalized count matrix, the colData (table with cells informations), and the rowData (table with the genes informations). See ?SingleCellExperiment for more details. The rowdata can help to study cross-talk between cell types or find surface protein-coding marker genes suitable for flow cytometry. The columns with the GO terms are *go_id* and *name_1006* (see manual). The slots can be accessed as indicated below: ```{r normalization_results} ## Accessing slots originalMat <- getCountMatrix(scr) SCEobject <- getSceNorm(scr) normMat <- SingleCellExperiment::logcounts(SCEobject) # checking what changed after the normalisation dim(originalMat) dim(normMat) # show first columns and rows of the count matrix originalMat[1:5,1:5] # show first columns and rows of the normalized count matrix normMat[1:5,1:5] # visualize first rows of metadata (coldata) coldataSCE <- as.data.frame(SummarizedExperiment::colData(SCEobject)) head(coldataSCE) # visualize beginning of the rowdata containing gene information rowdataSCE <- as.data.frame(SummarizedExperiment:::rowData(SCEobject)) head(rowdataSCE) ``` ## Generation of t-SNE coordinates *runCONCLUS* generates an object of fourteen (by default) tables with tSNE coordinates. Fourteen because it will vary seven values of principal components *PCs=c(4, 6, 8, 10, 20, 40, 50)* and two values of perplexity *perplexities=c(30, 40)* in all possible combinations. The chosen values of PCs and perplexities can be changed if necessary. We found that this combination works well for sc-RNA-seq datasets with 400-2000 cells. If you have 4000-9000 cells and expect more than 15 clusters, we recommend to use more first PCs and higher perplexity, for example, *PCs=c(8, 10, 20, 40, 50, 80, 100)* and *perplexities=c(200, 240)*. For details about perplexities parameter see ‘?Rtsne’. ```{r tsne_generation} scr <- generateTSNECoordinates(scr, cores=2) ``` Results can be explored as follows: ```{r tsne_showResults} tsneList <- getTSNEList(scr) head(getCoordinates(tsneList[[1]])) ``` ## Clustering with DB-SCAN Following the calculation of t-SNE coordinates, DBSCAN is run with a range of epsilon and MinPoints values which will yield a total of 84 clustering solutions (PCs x perplexities x MinPoints x epsilon). *minPoints* is the minimum cluster size which you assume to be meaningful for your experiment and *epsilon* is the radius around the cell where the algorithm will try to find *minPoints* dots. Optimal *epsilon* must lay on the knee of the k-NN function as shown in the "test_clustering/distance_graph.pdf" (See Test clustering section above). ```{r runDBSScan} scr <- runDBSCAN(scr, cores=2) ``` Results can be explored as follows: ```{r dbscan_showResults} dbscanList <- getDbscanList(scr) clusteringList <- lapply(dbscanList, getClustering) clusteringList[[1]][, 1:10] ``` ## Cell and cluster similarity matrix calculation The above calculated results are combined together in a matrix called “cell similarity matrix”. *runDBSCAN* function returns an object of class scRNASeq with its dbscanList slot updated. The list represents 84 clustering solutions (which is equal to number of PCs x perplexities x MinPoints x epsilon). Since the range of cluster varies from result to result, there is no exact match between numbers in different elements of the list. Cells having the same number within an element are guaranteed to be in one cluster. We can calculate how many times out of 84 clustering solutions, every two cells were in one cluster and that is how we come to the similarity matrix of cells. We want to underline that a zero in the dbscan results means that a cell was not assigned to any cluster. Hence, cluster numbers start from one. *clusterCellsInternal* is a general method that returns an object of class scRNASeq with its cellsSimilarityMatrix slot updated. Note that the number of clusters is set to 10 (it was set to 11 in the original study, see data section). Considering that clusters can be merged afterwards (see Supervised clustering section), we advise to keep this number. One might want to increase it if the cell similarity matrix or the tSNE strongly suggest more clusters but this number is suitable for most experiments in our hands. ```{r clusterCellsInternal} scr <- clusterCellsInternal(scr, clusterNumber=10) ``` ```{r clusterCellsInternal_showResults} cci <- getCellsSimilarityMatrix(scr) cci[1:10, 1:10] ``` After looking at the similarity between elements on the single-cell level, which is useful if we want to understand if there is any substructure which we did not highlight with our clustering, a "bulk" level where we pool all cells from a cluster into a representative "pseudo cell" can also be generated. This gives a *clusterSimilarityMatrix*: ```{r clustersSimilarityMatrix} scr <- calculateClustersSimilarity(scr) csm <- getClustersSimilarityMatrix(scr) csm[1:10, 1:10] ``` ## Plotting ### t-SNE colored by clusters or conditions CONCLUS generated 14 tSNE combining different values of PCs and perplexities. Each tSNE can be visualized either using coloring reflecting the results of DBScan clustering, the conditions (if the metadata contain a 'state' column) or without colors. Here *plotClusteredTSNE* is used to generate all these possibilities of visualization. ```{r plotClustered, eval=FALSE} tSNEclusters <- plotClusteredTSNE(scr, columnName="clusters", returnPlot=TRUE, silentPlot=TRUE) tSNEnoColor <- plotClusteredTSNE(scr, columnName="noColor", returnPlot=TRUE, silentPlot=TRUE) tSNEstate <- plotClusteredTSNE(scr, columnName="state", returnPlot=TRUE, silentPlot=TRUE) ``` For visualizing the 5th (out of 14) tSNE cluster: ```{r plotClustered_visualization1, results="hide", fig.keep = 'none', eval=FALSE} tSNEclusters[[5]] ``` ![DBscan results on the 5th tSNE.](figures/4_tSNEclusters.png){width="357px" height="359px"} This tSNE suggests that the cluster 8 corresponds to a rare population of cells. We can also appreciate that clusters 1 and 2, as clusters 9 and 10 could be considered as single clusters respectively. For visualizing the 5th (out of 14) tSNE cluster without colors: ```{r plotClustered_visualization2, results="hide", fig.keep = 'none', eval=FALSE} tSNEnoColor[[5]] ``` ![The 5th tSNE solution without coloring.](figures/5_tSNEnoColor.png){width="364px" height="365px"} For visualizing the 5th (out of 14) tSNE cluster colored by state: ```{r plotClustered_visualization3, results="hide", fig.keep = 'none', eval=FALSE} tSNEstate[[5]] ``` ![The 5th tSNE solution colored by cell condition.](figures/6_tSNEstate.png){width="355px" height="362px"} One can see that the cluster 8 contains only cells of the YS. \newpage ### Cell similarity heatmap The *cellsSimilarityMatrix* is then used to generate a heatmap summarizing the results of the clustering and to show how stable the cell clusters are across the 84 solutions. ```{r plotCellSimilarity, results="hide", fig.keep = 'none', eval=FALSE} plotCellSimilarity(scr) ``` ![Cell similarity matrix showing the conservation of clustering across the 84 solutions.](figures/7_cellSimilarity.png){width="459px" height="465px"} *CellsSimilarityMatrix* is symmetrical and its size proportional to the "number of cells x number of cells". Each vertical or horizontal tiny strip is a cell. Intersection shows the proportion of clustering iterations in which a pair of cells was in one cluster (score between 0 and 1, between blue and red). We will call this combination "consensus clusters" and use them everywhere later. We can appreciate that cellsSimilarityMatrix is the first evidence showing that CONCLUS managed not only to distinguish VE-Cad plus cells from the VE-Cad minus but also find sub-populations within these groups. ### Cluster similarity heatmap ```{r plotClustersSimilarity, results="hide", fig.keep = 'none', eval=FALSE} plotClustersSimilarity(scr) ``` ![Cluster similarity matrix. Each cell belonging to a particular cluster were merged into a pseudo-cell.](figures/8_clustersim.png){width="459px" height="460px"} In the *clusterSimilarityMatrix*, we can see that all clusters have a high value of similarity across all clustering solutions. Red color on the diagonal means that the group is homogenous, and usually, it is what we want to get. The yellow on the diagonal indicates that either that group consists of two or more equal sized subgroups. The orange on the diagonal suggests that groups can be merged. Bluish color points to a cluster of dbscan "outliers" that usually surrounds dense clouds of cells in t-SNE plots. As observed previously on the tSNE, clusters 1 and 2 as 9 and 10 are very similar. Clusters 3 and 8 are very different. ## Marker genes identification To understand the nature of the consensus clusters identified by CONCLUS, it is essential to identify genes which could be classified as marker genes for each cluster. To this aim, each gene should be "associated" to a particular cluster. This association is performed by looking at up-regulated genes in a particular cluster compared to the others (multiple comparisons). The method *rankGenes* performs multiple comparisons of all genes from the object and rank them according to a score reflecting a FDR power. In summary, the method *conclus::rankGenes()* gives a list of marker genes for each cluster, ordered by their significance. See ?rankGenes for more details. ```{r rankGenes} scr <- rankGenes(scr) ``` ```{r rankGenes_result} markers <- getMarkerGenesList(scr, cluster=1) ``` The top 10 markers by cluster (default) can be selected with: ```{r topMarkers_result} scr <- retrieveTopClustersMarkers(scr, removeDuplicates=TRUE) topMarkers <- getTopMarkers(scr) topMarkers ``` Cluster 8 contains the marker genes A2m, Cldn1, Maob, Pcbd1, Pdzk1, Krt20, Klb, Serpind1, and F2. Alpha-2-Macroglobulin (A2m) is a large (720 KDa) plasma protein found in the blood and is mainly produced by the liver; Claudins (CLDNs) are a family of integral membrane proteins, Cldn1 being expressed by epithelial liver cells; The monoamine oxidase (Maob) is found in abundance in the liver. All these markers play a role in the liver. This sub-population is hence called liver-like hereafter. # Plot a heatmap with positive marker genes Following the execution of the `retrieveTopClustersMarkers` method, CONCLUS offers the option to visualize the marker genes on a heatmap. Below we chose to show the selected 10 marker genes per cluster which should generate a heatmap with 100 genes (10 marker genes x 10 clusters). This is convenient for visualization. In practice, the number of genes in this heatmap will be less than 100 because some genes were classified as markers for more than one cluster. This can happen when several clusters correspond to similar cellular types. After selecting the top markers with the method `retrieveTopClustersMarkers`, the method *plotCellHeatmap* is used to order clusters and genes by similarity (the same order as in the *clusterSimilarityMatrix*) and show mean-centered normalized data. Mean-centering allows seeing the relative expression of a gene compared to the mean. ```{r plotCellHeatmap_marker1, results="hide", fig.keep = 'none', eval=FALSE} plotCellHeatmap(scr, orderClusters=TRUE, orderGenes=TRUE) ``` ![Heatmap showing the expression of the top 10 markers for each cluster. The values are normalized according to the mean.](figures/9_cellHeatmap.png) One can also visualize the heatmap without mean centering and appreciate the importance of the normalization of colors by row. Indeed, the different markers are much harder to identify. ```{r plotCellHeatmap_marker2, results="hide", fig.keep = 'none', eval=FALSE} plotCellHeatmap(scr, orderClusters=TRUE, orderGenes=TRUE, meanCentered=FALSE) ``` ![Same heatmap as before without normalizing by the mean.](figures/10_cellHeatmapNoMean.png) Alternative order of clusters is by name or by hierarchical clustering as in the default pheatmap function. # Plot t-SNE colored by expression of a selected gene *PlotGeneExpression* allows visualizing the normalized expression of one gene in a t-SNE plot. It can be useful to inspect the specificity of top markers. Cluster 8 was identified as being a liver-like rare population. Below are examples of the expression of its marker genes: ```{r plotGeneExpression1, eval=FALSE} plotGeneExpression(scr, "Maob", tSNEpicture=5) ``` ![Maob Marker gene of cluster 8.](figures/11_Maob_cluster8.png){width="363px" height="362px"} ```{r plotGeneExpression2, eval=FALSE} plotGeneExpression(scr, "Pcbd1", tSNEpicture=5) ``` ![Pcbd1 Marker gene of cluster 8.](figures/12_Pcbd1_cluster8.png){width="364px" height="366px"} ```{r plotGeneExpression3, eval=FALSE} plotGeneExpression(scr, "Serpinf2", tSNEpicture=5) ``` ![Serpinf2 Marker gene of cluster 8.](figures/13_Serpinf2_cluster8.png){width="360px" height="363px"} ```{r plotGeneExpression4, eval=FALSE} plotGeneExpression(scr, "Cldn1", tSNEpicture=5) ``` ![Cldn1 Marker gene of cluster 8.](figures/14_Cldn1_cluster8.png){width="359px" height="362px"} # Collect publicly available info about marker genes ## Collect information for the top 10 markers for each cluster `retrieveGenesInfo` retrieves gene information from NCBI, MGI, and UniProt. It requires the `retrieveTopMarkers` method to have been run on the object. ```{r getGenesInfo} scr <- retrieveGenesInfo(scr) result <- getGenesInfos(scr) head(result) ``` `result` contains the following columns: + uniprot_gn_symbol: Uniprot gene symbol. + clusters: The cluster to which the gene is associated. + external_gene_name: The complete gene name. + go_id: Gene Ontology (GO) identification number. + mgi_description: If the species is mouse, description of the gene on MGI. + entrezgene_description: Description of the gene by the Entrez database. + gene_biotype: protein coding gene, lincRNA gene, miRNA gene, unclassified non-coding RNA gene, or pseudogene. + chromosome_name: The chromosome on which the gene is located. + Symbol: Official gene symbol. + ensembl_gene_id: ID of the gene in the ensembl database. + mgi_id: If the species is mouse, ID of the gene on the MGI database. + entrezgene_id: ID of the gene on the entrez database. + uniprot_gn_id: ID of the gene on the uniprot database. # Supervised clustering Until now, we have been using CONCLUS in an unsupervised fashion. This is a good way to start the analysis of a sc-RNA-seq dataset. However, the knowledge of the biologist remains a crucial asset to get the maximum of the data. This is why we have included in CONCLUS, additional options to do supervised analysis (or “manual” clustering) to allow the researcher to use her/his biological knowledge in the CONCLUS workflow. Going back to the example of the Shvartsman *et al.* dataset above (cluster similarity heatmap), one can see that some clusters clearly belong to the same family of cells after examining the clusters_similarity matrix generated by CONCLUS. As previously mentioned, clusters 1 and 2 as 9 and 10 are very similar. In order to figure out what marker genes are defining these families of clusters, one can use manual clustering in CONCLUS to fuse clusters of similar nature: i.e. combine clusters 1 and 2 (9 and 10) together. ```{r merging_clusters, eval=FALSE} ## Retrieving the table indicating to which cluster each cell belongs clustCellsDf <- retrieveTableClustersCells(scr) ## Replace "2/10" by "1/9" to merge 1/2 and 9/10 clustCellsDf$clusters[which(clustCellsDf$clusters == 2)] <- 1 clustCellsDf$clusters[which(clustCellsDf$clusters == 10)] <- 9 ## Modifying the object to take into account the new classification scrUpdated <- addClustering(scr, clusToAdd=clustCellsDf) ``` Now we can visualize the new results taking into account the new classification: ```{r addClusteringManually_visualization2, results="hide", fig.keep = 'none', eval=FALSE} plotCellSimilarity(scrUpdated) ``` ![Updated cells similarity matrix with merged 1/2 and 9/10.](figures/15_updatedCellSim.png){width="459px" height="463px"} ```{r addClusteringManually_visualization3, results="hide", fig.keep = 'none', eval=FALSE} plotCellHeatmap(scrUpdated, orderClusters=TRUE, orderGenes=TRUE) ``` ![Updated cells heatmap with merged 1/2 and 9/10.](figures/16_updatedCellHeatmap.png) ```{r addClusteringManually_visualization4, results="hide", fig.keep = 'none', eval=FALSE} tSNEclusters <- plotClusteredTSNE(scrUpdated, columnName="clusters", returnPlot=TRUE, silentPlot=TRUE) tSNEclusters[[5]] ``` ![5th tSNE solution colored by dbscan result showing the merged clusters.](figures/17_updatedTSNE.png){width="360px" height="362px"} The cell heatmap above shows that Col1a1 is a good marker of cluster 1 and that Cdk8 is a good marker of cluster 9 (at the bottom). One can visualize them in the t-SNE plots below. ```{r replot1, eval=FALSE} plotGeneExpression(scrUpdated, "Col1a1", tSNEpicture=5) ``` ![Col1a1 Marker gene for cluster 1 (mix of old clusters 1 and 2)](figures/18_Col1a1_cluster2.png){width="359px" height="364px"} ```{r replot2, eval=FALSE} plotGeneExpression(scrUpdated, "Cdk8", tSNEpicture=5) ``` ![Cdk8 Marker gene for clusters 9 (mix of old clusters 9 and 10)](figures/19_Cdk8_cluster9.png){width="361px" height="363px"} # Conclusion Here we demonstrated how to use CONCLUS and combine multiple parameters testing for sc-RNA-seq analysis. It allowed us to identify a rare population in the data of Shvartsman *et al* and will help gaining deeper insights into others. # Session info ```{r sessionInfo} sessionInfo() ```