%\VignetteEngine{knitr::knitr} \documentclass[a4paper, 9pt]{article} <>= BiocStyle::latex() @ %\VignetteIndexEntry{Single-cell Interpretation via Multi-kernel LeaRning (\Biocpkg{SIMLR})} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{placeins} \usepackage{url} \usepackage{tcolorbox} \usepackage{authblk} \begin{document} \title{Single-cell Interpretation via Multi-kernel LeaRning (\Biocpkg{SIMLR})} \author[1]{Daniele Ramazzotti} \author[1]{Bo Wang} \author[2]{Luca De Sano} \author[1]{Serafim Batzoglou} \affil[1]{Department of Computer Science, Stanford University, Stanford, CA , USA.} \affil[2]{Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milano, Italy.} \date{\today} \maketitle \begin{tcolorbox}{\bf Overview.} Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical for the identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. \\ \vspace{1.0cm} {\em In this vignette, we give an overview of the package by presenting some of its main functions.} \vspace{1.0cm} \renewcommand{\arraystretch}{1.5} \end{tcolorbox} <>= library(knitr) opts_chunk$set( concordance = TRUE, background = "#f3f3ff" ) @ \newpage \tableofcontents \section{Changelog} \begin{itemize} \item[1.0.0] implements SIMLR and SIMLR feature ranking algorithms. \item[1.0.2] implements SIMLR large scale algorithms. \item[1.5.1] implements CIMLR clustering method. \item[1.7.3] removes CIMLR clustering method; please refer to https://github.com/danro9685/CIMLR. \end{itemize} \section{Algorithms and useful links} \label{sec:stuff} \renewcommand{\arraystretch}{2} \begin{center} \begin{tabular}{| l | p{6.0cm} | l |} {\bf Acronym} & {\bf Extended name} & {\bf Reference}\\ \hline SIMLR & Single-cell Interpretation via Multi-kernel LeaRning & \href{https://www.nature.com/articles/nmeth.4207}{Publication}\\ \hline Package & A Tool for Large-Scale Genomic Analyses & \href{https://onlinelibrary.wiley.com/doi/full/10.1002/pmic.201700232}{Publication}\\ \hline \end{tabular} \end{center} \section{Using the SIMLR R package} We first load the data provided as examples in the package. The dataset BuettnerFlorian is used for an example of the standard SIMLR, while the dataset ZeiselAmit is used for an example of SIMLR large scale. <>= library(SIMLR) data(BuettnerFlorian) data(ZeiselAmit) @ The external R package igraph is required for the computation of the normalized mutual information to assess the results of the clustering. <>= library(igraph) @ We now run SIMLR as an example on an input dataset from Buettner, Florian, et al. "Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells." Nature biotechnology 33.2 (2015): 155-160. For this dataset we have a ground true of 3 cell populations, i.e., clusters. <>= set.seed(11111) example = SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0) @ We now compute the normalized mutual information between the inferred clusters by SIMLR and the true ones. This measure with values in [0,1], allows us to assess the performance of the clustering with higher values reflecting better performance. <>= nmi_1 = compare(BuettnerFlorian$true_labs[,1], example$y$cluster, method="nmi") nmi_1 @ As a further understanding of the results, we now visualize the cell populations in a plot. <>= plot(example$ydata, col = c(topo.colors(BuettnerFlorian$n_clust))[BuettnerFlorian$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for BuettnerFlorian") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image-1} \end{center} \caption{Visualization of the 3 cell populations retrieved by SIMLR on the dataset by Florian, et al.} \end{figure*} We also run SIMLR feature ranking on the same inputs to get a rank of the key genes with the related pvalues. <>= set.seed(11111) ranks = SIMLR_Feature_Ranking(A=BuettnerFlorian$results$S,X=BuettnerFlorian$in_X) @ <>= head(ranks$pval) head(ranks$aggR) @ Similarly, we show an example for SIMLR large scale on an input dataset being a reduced version of the dataset provided in Buettner, Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142. For this dataset we have a ground true of 9 cell populations, i.e., clusters. But we should notice that here we use for computational reasons a reduced version of the data, so please refer to the original publication for the full data. <>= set.seed(11111) example_large_scale = SIMLR_Large_Scale(X = ZeiselAmit$in_X, c = ZeiselAmit$n_clust, kk = 10) @ We compute the normalized mutual information between the inferred clusters by SIMLR large scale and the true ones. <>= nmi_2 = compare(ZeiselAmit$true_labs[,1], example_large_scale$y$cluster, method="nmi") nmi_2 @ As a further understanding of the results, also in this case we visualize the cell populations in a plot. <>= plot(example_large_scale$ydata, col = c(topo.colors(ZeiselAmit$n_clust))[ZeiselAmit$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for ZeiselAmit") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image_large_scale-1} \end{center} \caption{Visualization of the 9 cell populations retrieved by SIMLR large scale on the dataset by Zeisel, Amit, et al.} \end{figure*} Now, as a final example, we also provide the results of two heuristics (see the original SIMLR paper) to estimate the number of clusters from data. <>= set.seed(53900) NUMC = 2:5 res_example = SIMLR_Estimate_Number_of_Clusters(BuettnerFlorian$in_X, NUMC = NUMC, cores.ratio = 0) @ Best number of clusters, K1 heuristic: <>= NUMC[which.min(res_example$K1)] @ K2 heuristic: <>= NUMC[which.min(res_example$K2)] @ Results of the two heuristics: <>= res_example @ \section{\Rcode{sessionInfo()}} <>= toLatex(sessionInfo()) @ \end{document}