--- title: "Introduction to the CHETAH package" author: "Jurrian de Kanter" date: "`r Sys.Date()`" output: html_document: number_sections: TRUE toc: TRUE theme: united includes: before_body: header.html vignette: > %\VignetteIndexEntry{Introduction to the CHETAH package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) knitr::opts_knit$set( root.dir = system.file('data', package = 'CHETAH') ) library(Matrix) library(CHETAH) ``` # Introduction __CHETAH (CHaracterization of cEll Types Aided by Hierarchical classification) is a package for cell type identification of single-cell RNA-sequencing (scRNA-seq) data.__ A pre-print of the article describing CHETAH is available at [bioRxiv](https://www.biorxiv.org/content/10.1101/558908v1). Summary: Cell types are assigned by correlating the input data to a reference in a hierarchical manner. This creates the possibility of assignment to intermediate types if the data does not allow to fully classify to one of the types in the reference. CHETAH is built to work with scRNA-seq references, but will also work (with limited capabilities) with RNA-seq or micro-array reference datasets. So, to run CHETAH, you will only need: * your input data * a reference dataset, annotated with cell types + Both as a `SingleCellExperiment` ## At a glance To run chetah on an input count matrix `input_counts` with t-SNE^1^ coordinates in `input_tsne`, and a reference count matrix `ref_counts` with celltypes vector `ref_ct`, run: ```{r glance, echo=TRUE, eval=FALSE} ## Make SingleCellExperiments reference <- SingleCellExperiment(assays = list(counts = ref_counts), colData = DataFrame(celltypes = ref_ct)) input <- SingleCellExperiment(assays = list(counts = input_counts), reducedDims = SimpleList(TSNE = input_tsne)) ## Run CHETAH input <- CHETAHclassifier(input = input, ref_cells = reference) ## Plot the classification PlotCHETAH(input) ## Extract celltypes: celltypes <- input$celltype_CHETAH ``` __A tumor micro-environment reference dataset containing all major cell types for tumor data can be downloaded:__ [here](https://figshare.com/s/aaf026376912366f81b6). This reference can be used for all (tumor) input datasets. # Some background CHETAH constructs a classification tree by hierarchically clustering the reference data. The classification is guided by this tree. In each node of the tree, input cells are either assigned to the right, or the left branch. A confidence score is calculated for each of these assignments. When the confidence score for an assignment is lower than the threshold (default = 0.1), the classification for that cell stops in that node. This results in two types of classifications: * __final types__: Cells that are classified to one of the leaf nodes of the tree (i.e. a cell type of the reference). * __intermediate types__: Cells that had a confidence score lower than the threshold in a certain node are assigned to that intermediate node of the tree. This happens when a cell has approximately the same similarity to the cell types in right and the left branch of that node.

CHETAH generates generic names for the intermediate types: "Unassigned" for cells that are classified to the very first node, and "Node1", "Node2", etc for the additional nodes

# Installation CHETAH will be a part of Bioconductor starting at release 2.9 (30th of April), and will be available by: ```{r bioconductor_inst, echo=TRUE, eval=FALSE} ## Install BiocManager is neccesary if (!require("BiocManager")) { install.packages("BiocManager") } BiocManager::install('CHETAH') # Load the package library(CHETAH) ``` The development version can be downloaded from the development version of Bioconductor (in R v3.6). ```{r github _inst, echo=TRUE, eval=FALSE} ## Install BiocManager is neccesary if (!require("BiocManager")) { install.packages("BiocManager") } BiocManager::install('CHETAH', version = "devel") # Load the package library(CHETAH) ``` # Preparing your data ## Required data If you have your data stored as `SingleCellExperiments`, continue to [the next step](#sce). Otherwise, you need the following data before you begin: * input scRNA-seq count data of the cells to be classified + a data.frame or matrix, with cells in the columns and genes in the rows * (!) normalized scRNA-seq count data of reference cells + in similar format as the input * the cell types of the reference cells + a named character vector (names corresponding to the colnames of the reference counts) * (optional) a 2D reduced dimensional representation of your input data for visualization: e.g. t-SNE^1^, PCA. + a two-column matrix/data.frame, with the cells in the rows and the two dimensions in the columns As an example on how to prepare your data, we will use melanoma input data from [Tirosh et al.](10.1126/science.aad0501) and head-neck tumor reference data from [Puram et al.](10.1016/j.cell.2017.10.044) as an example.

For information on how to create your own reference see [Creating a Reference](#ref-prep)

```{r prepare_examp, echo=TRUE, eval=TRUE} ## To prepare the data from the package's internal data, run: celltypes_hn <- headneck_ref$celltypes counts_hn <- assay(headneck_ref) counts_melanoma <- assay(input_mel) tsne_melanoma <- reducedDim(input_mel) ## The input data: a Matrix class(counts_melanoma) counts_melanoma[1:5, 1:5] ## The reduced dimensions of the input cells: 2 column matrix tsne_melanoma[1:5, ] all.equal(rownames(tsne_melanoma), colnames(counts_melanoma)) ## The reference data: a Matrix class(counts_hn) counts_hn[1:5, 1:5] ## The cell types of the reference: a named character vector str(celltypes_hn) ## The names of the cell types correspond with the colnames of the reference counts: all.equal(names(celltypes_hn), colnames(counts_melanoma)) ``` ## `SingleCellExperiments` {#sce} CHETAH expects data to be in the format of a `SingleCellExperiment`, which is an easy way to store different kinds of data together. Comprehensive information on this data type can be found [here](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html). A `SingleCellExperiment` holds three things: * counts: `assays` - as a `list` of `Matrices` * meta-data: `colData` - as `DataFrames` * reduced dimensions (e.g. t-SNE, PCA): `ReducedDims` - as a `SimpleList` of 2-column `data.frames` or `matrices` CHETAH needs * a reference `SingleCellExperiment` with: + an assay + a colData column with the corresponding cell types (default "celltypes") * an input `SingleCellExperiment` with: + an assay + a reducedDim (e.g. t-SNE) For the example data, we would make the two objects by running: ```{r make_sce, echo=TRUE, eval=TRUE} ## For the reference we define a "counts" assay and "celltypes" metadata headneck_ref <- SingleCellExperiment(assays = list(counts = counts_hn), colData = DataFrame(celltypes = celltypes_hn)) ## For the input we define a "counts" assay and "TSNE" reduced dimensions input_mel <- SingleCellExperiment(assays = list(counts = counts_melanoma), reducedDims = SimpleList(TSNE = tsne_melanoma)) ```

Note: CHETAH functions default to the first `assay`/`reducedDim` in an object and "celltypes" for the reference's `colData`. See ?CHETAHclassifier and ?PlotCHETAH on how to change this behaviour.

# Running CHETAH Now that the data is prepared, running chetah is easy: ```{r run_chetah, echo=TRUE, eval=TRUE} input_mel <- CHETAHclassifier(input = input_mel, ref_cells = headneck_ref) ``` ## The output CHETAH returns the input object, but added: * input$celltype_CHETAH + a named character vector that can directly be used in any other workflow/method. * "hidden" `int_colData` and `int_metadata`, not meant for direct interaction, but + which can all be viewed and interacted with using: `PlotCHETAH` and `CHETAHshiny` ## Standard plots CHETAH's classification can be visualized using: `PlotCHETAH`. This function plots both the classification tree and the t-SNE (or other provided reduced dimension) map. Either the __final types__ or the __intermediate types__ are colored in these plots. The non-colored types are represented in a grayscale. To plot the __final types__: ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} PlotCHETAH(input = input_mel) ```

Please note that each type "NodeX" corresponds to the node with number X and that the type "Unassigned" corresponds to the node 0

Conversely, to color the __intermediate types__: ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} PlotCHETAH(input = input_mel, interm = TRUE) ``` If you would like to use the classification, and thus the colors, in another package (e.g. Seurat^2^), you can extract the colors using: ```{r eval=FALSE} colors <- PlotCHETAH(input = input_mel, return_col = TRUE) ``` ## `CHETAHshiny` The classification of CHETAH and other outputs like profile and confidence scores can be visualized in a shiny application that allows for easy and interactive analysis of the classification. Here you can view: * the confidence of all assignments * the classification in an interactive window * the genes used by CHETAH, an it's expression in the input data * a lot more The following command will open the shiny application as in an R window. The page can also be opened in your default web-browser by clicking "Open in Browser" at the very top. ```{r eval = FALSE} CHETAHshiny(input = input_mel) ``` # Changing classification ## Confidence score CHETAH calculates a confidence score for each assignment of an input cell to one of the branches of a node. The confidence score: * has a value between 0 and 2 * will normally lie between 0 and 1 * 0 represents no confidence for an assignment, 1 high confidence. __The default confidence threshold of CHETAH is 0.1.__ This means that whenever a cell is assigned to a branch and the confidence of that assignment is lower than 0.1, the classification will stop in that node. __The confidence threshold can be adjusted in order to classify more or fewer cells to a final type:__ * Using a confidence threshold of 0 will classify all input cells to a final type. Be aware that this classification can be noisy and can contain incorrect classifications. * Using a threshold of 0.2, 0.3, 0.4, etc, will classify a decreasing number of cells to a final type, with the remaining cells having a increasingly high confidence throughout all nodes in the tree. For example, to only classify cells with very high confidence: ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} input_mel <- Classify(input = input_mel, 0.8) PlotCHETAH(input = input_mel, tree = FALSE) ``` Conversely, to classify all cells: ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} input_mel <- Classify(input_mel, 0) PlotCHETAH(input = input_mel, tree = FALSE) ``` ## Renaming types For renaming types in the tree, CHETAH comes with the `RenameBelowNode` function. This can be interesting when you are more interested in the general types, type in the different __intermediate__ and __final types__. For the example data, let's say that we are not interested in all the different subtypes of T-cells (under Node6 and Node7), we can name all these cells "T cells" by running: ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} input_mel <- RenameBelowNode(input_mel, whichnode = 6, replacement = "T cell") PlotCHETAH(input = input_mel, tree = FALSE) ``` To reset the classification to its default, just run `Classify` again: ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} input_mel <- Classify(input_mel) ## the same as Classify(input_mel, 0.1) PlotCHETAH(input = input_mel, tree = FALSE) ``` # Creating a reference {#ref-prep} ## Step 0: Obtain a reference. * Download or use one of your own scRNA-seq datasets, for which cell type labels are available. + A tumor micro-environment reference dataset containing all major cell types for tumor data can be downloaded [here](https://figshare.com/s/aaf026376912366f81b6) * Make a `SingleCellExperiment` object, as explained [above](#sce) ## Step 1: good reference characteristics CHETAH can use any scRNA-seq reference, but the used reference greatly influences the classification. The following general rules apply on choosing and creating a reference: * Better results can be achieved by using a reference and an input dataset that are from the same biological type, or at least consist of cells that are in the same cell state. E.g. for an input dataset of PBMCs, a bone marrow reference dataset could be used, but as these cells are more naive or precursor cells, this might negatively influence the classification. In this case, another PBMC dataset would work optimally. * The annotation of the reference directly influences the classification. The more accurate the cell type labels, the better the classification. Also, discard any cells that are doublets, contaminations, etc. * The sparser the reference data, the more reference cells are needed to create a reliable reference. For high coverage Smart-Seq2^3^ data, as little as 10-20 cells are needed per cell type. For sparser 10X Genomics data, 100+ cells normally gives optimal results.

To reduce computation time with very big references, first try to subsample each cell type to 100-200 cells. CHETAH should have very similar performance to using all cells. For a `SingleCellExperiment` "ref" with cell type metadata "celltypes", this could be done by: ```{r, eval = FALSE} cell_selection <- unlist(lapply(unique(ref$celltypes), function(type) { type_cells <- which(ref$celltypes == type) if (length(type_cells) > 200) { type_cells[sample(length(type_cells), 200)] } else type_cells })) ref_new <- ref[ ,cell_selection] ```

## Step 1: normalization CHETAH does not require normalized input data, but __the reference data has to be normalized beforehand__. The reference data that is used in this vignette is already normalized. __Only for sake of the example__, we use this dataset anyway to perform nomalization: ```{r, eval = FALSE} assay(headneck_ref, "counts") <- apply(assay(headneck_ref, "counts"), 2, function(column) log2((column/sum(column) * 100000) + 1)) ``` ## Step 2: discaring of house-keeping genes Certainly with reference with relatively high drop-out rates, CHETAH can be influenced by highly expressed, and thus highly variable, genes. In our experience, mainly ribosomal protein genes can cause such an effect. We therefore delete these genes, using the "ribosomal" list from [here](https://figshare.com/s/aaf026376912366f81b6) ```{r, eval = FALSE} ribo <- read.table("~/ribosomal.txt", header = FALSE, sep = '\t') headneck_ref <- headneck_ref[!rownames(headneck_ref) %in% ribo[,1], ] ``` ## Step 3: Reference QC The performance of CHETAH is heavily dependent on the quality of the reference. The quality of the reference is affected by: 1. the quality of the scRNA-seq data 2. the accuracy of the cell type labels To see how well CHETAH can distinguish between the cell types in a reference, `CorrelateReference` and more importantly `ClassifyReference` can be run. ### CorrelateReference `CorrelateReference` is a function that, for every combination of two cell types, finds the genes with the highest fold-change between the two and uses these to correlate them to each other. If the reference is good, all types will correlate poorly or even better, will anti-correlate. ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12} CorrelateReference(ref_cells = headneck_ref) ``` In this case, most cell types will be distinguishable: many types don't correlate, or anti-correlate. However, some types are quite similar. Regulatory and CD4 T cells, or CD4 and CD8 T cells, might be hard to distinguish in the input data. ### ClassifyReference Another check to see whether CHETAH can distinguish between the cell types in the reference is `ClassifyReference`. This function uses the reference to classify the reference itself. If CHETAH works well with the reference, there should be almost no mix-ups in the classification, i.e. all cells of type A should be classified to type A. ```{r out.width="100%", dpi = 100, fig.height = 6, fig.width = 12, fig.cap = "In this plot, the rows are the original cell type labels, the columns the labels that were assigned during classification. The colors and sizes of the squares indicate which part of the cells of the row name type are classified to the column type. E.g. 4th row, 2th column shows that 5-10% of CD4 T cells are classified as regulatory T cells."} ClassifyReference(ref_cells = headneck_ref) ``` In this reference, there is never more than 10% mix-up between two cell types. In addition, a low percentage of cells is classified as an intermediate type. Most mix-ups occur between subtypes of T cells. In this case the user should be aware that these cell type labels have the highest chance to interchange. # Optimizing the classification CHETAH is optimized to give good results in most analyses, but it can happen that a classification is imperfect. When CHETAH does not give the desired output (too little cells are classified, visually random classification, etc), These are the following steps to take (in this order): * Check if your reference is correctly created (see [above](#ref-prep). This is the most important step! * If too little cells are classified, lower the Confidence threshold to 0.05 or 0.01 + Beware of false positives! Always check if the result makes sense. + using `input[!(grepl("^RP", rownames(input))), ]` is an imperfect, but very quick way to do this. * Try using a different number of genes for the classification (the `n_genes` parameter). + this defaults to 200, but sometimes 100 or (in sparse data) 500 can give better results * Find another reference * Try one of the other scRNA-seq classification methods available. * Open an issue on [github](github.com/jdekanter/CHETAH). We might be able to help you further

^1^ Van Der Maaten and Hinton (2008). Visualizing high-dimensional data using t-sne. _J Mach Learn Res_. 9: 2579-2605. doi: 10.1007/s10479-011-0841-3. ^2^ Satija et al. (2015) Spatial reconstruction of single-cell gene expression data. _Nat Biotechnol_. 33(5):495-502. May 2015. doi: 10.1038/nbt.3192. More information at: https://satijalab.org/seurat/ ^3^ Picelli et al. (2013) Smart-seq2 for sensitive full-length transcriptome profiling in single cells. _Nat Methods_. 10(11): 1096-1100. doi: 10.1038/nmeth.2639. ```{r} sessionInfo() ```