--- title: "The cytoKernel user's guide" author: - name: Tusharkanti Ghosh affiliation: Colorado School of Public Health - name: Victor Lui affiliation: University of Colorado, Anschutz Medical Campus, School of Medicine - name: Pratyaydipta Rudra affiliation: Oklahoma State University - name: Souvik Seal affiliation: Colorado School of Public Health - name: Thao Vu affiliation: Colorado School of Public Health - name: Elena Hsieh affiliation: University of Colorado, Anschutz Medical Campus, School of Medicine - name: Debashis Ghosh affiliation: Colorado School of Public Health output: BiocStyle::html_document: highlight: "tango" code_folding: show toc: true toc_float: collapsed: false package: cytoKernel abstract: cytoKernel (Differential Expression Analysis using Kernel-based Score test) is a nonlinear method based approach that detects differentially expressed features for high-dimensional biological data. citation_package: natbib bibliography: cytoKernel.bib vignette: | %\VignetteIndexEntry{The CytoK user's guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE, echo=FALSE} # date: "`r doc_date()`" # "`r pkg_ver('BiocStyle')`" # ``` --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() ``` # Introduction The majority of statistical strategies used in differential expression analysis of high-dimensional biological data (e.g., gene expression, CyTOF) are based on linear models or (analysis of variance). Linear models are favored in many biological problems mainly due to their ease of use and interpretability. However, some biological problems (e.g., gene expression, CyTOF) often require nonlinear models as linear models might be insufficient to capture the relationship among the co-expression features (e.g., relationships among signaling markers within (and across) cell subpopulations in CyTOF data). Kernel-based approaches extend the class of linear models to nonlinear models in a computationally tractable manner. Additionally, kernel-based approaches assume a more general relationship based on a Hilbert space of functions spanned by a certain kernel function instead of assuming a linear functional relationship. In this vignette, we introduce a differential expression analysis of high-dimensional biological data using kernel based score test, referred to as **kernel-based differential analysis** (**cytoKernel**), which can identify differentially expressed features. The `cytoKernel` R-package contains the `CytoK()` function, which computes the adjusted p value for each feature based on two groups. Further, the package calculates the shrunken effective size and its corresponding effective size standard deviation (sd) using the Adaptive Shrinkage (ash) procedure (@stephens2017false). We demonstrate with an example data set that the Differential expression analysis using Kernel-based Score test (cytoKernel) procedure can be adapted to flow and mass cytometry experiments along with other biological experiments (e.g., gene expression, RNASeq). Consider a matrix of order $p \times n$ with elements $x_{ij}$, $i=1,\dots,p$ and $j=1,\dots,n$, where $n$ is the number of samples (Group1+Group2 combined) and $p$ is the number of features. We observe the data $\{x_{ij},y_j\}$, where $x_{ij}$ is the median intensity for sample $j$ and feature $i$. $y_j$ (binary response) is the group (or condition) label for sample $j$ ($y_j=0$ (Group1) and $y_j=1$ (Group2). For feature $i$, $i=1,\dots,p$, we define a simple logistic (semi-parametric) model, $$ logit[P(y_j=1)] = \beta_0+ f(x_{ij}), $$ where $f(.)$ is a centered smooth function in a Reproducible Kernel Hilbert Space (RKHS) spanned by $k$. If $H_0:~f(.)=0$, then feature expression value $x_{ij}$ is not related to the group labels $y_j$ for feature $i$ i.e., feature $i$ is differentially expressed. Let $K$ be a $n \times n$ Gram matrix with $K_{st}=K_{\rho}(x_{sj},~x_{tj})$. Here, $k_{\rho}(.,.)$ is the reproducing kernel of RKHS which contains $f(.)$ and $\rho$ is an unknown kernel parameter. Let $\mathbf{y}$ be a $n \times 1$ vector of $0$ and $1$. The score test statistic under null hypothesis ($H_0:~f(.)=0$) in the logistic model defined above is, $$ S(\rho) = \frac{Q(\rho)- \mu_{Q}}{\sigma_{Q}}, $$ where $Q(\rho)=(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}\mathbf{K}(\mathbf{y}-\mathbf{\hat{\mu_0}})$, $\mathbf{\hat{\mu_0}}={logit}^{-1}\hat{\beta_0}$ and $\hat{\beta_0}$ is the estimate of $\beta_0$ under null model. More details about the estimation of $\mu_{Q}$, $\sigma_{Q}$ and choices of $\rho$ in (@zhan2015kernel, @liu2008estimation, @davies1987hypothesis. $Q(\rho)$ can be rewritten as, $$ Q(\rho) = \sum_s {\sum_t {{k(x_{is},x_{it})}(y_{s}-\hat{\mu_0})(y_{t}-\hat{\mu_0})}}, $$ which is the component-wise product of matrices $\mathbf{K}$ and $(\mathbf{y}-\mathbf{\hat{\mu_0}})(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}$. We use a Gaussian distance based kernel: $$ k(x_{is},x_{it})= exp\left\{-\frac{(x_{is}-x_{it})^2}{\rho}\right\}. $$ $S(\rho)$ has a Normal distribution for each value of $\rho$ (@davies1987hypothesis). The data structure is shown in Figure 1 below. ```{r, echo=FALSE, fig.cap="Feature-Sample data matrix", out.width = '100%'} knitr::include_graphics("cytoSchematic.png") ``` # Getting Started Load the package in R ```{r load-lib, message=FALSE} library(cytoKernel) ``` # cytoHD Data The **cytoKernel** package contains a pre-processed median marker expressions data `SummarizedExperiment` assay object of 126 cluster-marker combinations (features) measured in 8 subjects (4 measured before and 4 upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30’, resulting in a total of 8 samples) from (@bodenmiller2012multiplexed) that was also used in the CITRUS paper (@bruggner2014automated) and `CATALYST` (@crowell2020r). In this vignette, we only used a subset of the original raw cytometry data downloaded from the Bioconductor data package `HDCytoData` (@weber2019hdcytodata) using the command (Bodenmiller_BCR_XL_flowSet()). ## cytoHDBMW data pre-processing The **cytoHDBMW** data in the **cytoKernel** package was pre-processed using the CATALYST Bioconductor package (@crowell2020r). The data pre-processing include $4$ steps and they are as follows: 1. `Creating a SingleCellExperiment Object`: the flowSet data object along with the metadata are converted into a SingleCellExperiment object using the `CATALYST` R/Bioconductor package. 2. `Clustering`: We apply `Louvain` algorithm using the R package `igraph` (@csardi2006igraph) to cluster the expression values by the state markers (surface markers) (@traag2019louvain). 3. `Median`: Medians are calculated within a cluster for every signaling marker and subject using the `scuttle` Bioconductor package (@mccarthy2017scater). 4. `Aggregating and converting the data`: We convert the aggregated data into a SummarizedExperiment. ```{r data-1, message=FALSE, warning=FALSE} data("cytoHDBMW") cytoHDBMW ``` # Using the `CytoK()` function ## Input for `CytoK()` The `CytoK()` function must have two object as input: 1. `object`: a data frame or a matrix or a SummarizedExperiment object with abundance measurements of metabolites (features) on the rows and samples (samples) as the columns. `CytoK()` accepts objects which are a data frame or matrix with observations (e.g. cluster-marker combinations) on the rows and samples as the columns. 2. `group_factor`: a binary categorical response variable that represents the group condition for each sample. For example if the samples represent two different groups or conditions (e.g., before stimulation and after stimulation), provide `CytoK()` with a phenotype representing which columns in the `object` are different groups. 3. `lowerRho`: **optional** a positive value that represents the lower bound of the kernel parameter. Default is 2. 4. `upperRho`: **optional** a positive value that represents the upper bound of the kernel parameter. Default is 12. 5. `gridRho`: **optional** a positive value that represents the number of grid points in the interval of upper and bound of the kernel parameter. Default is 4. 6. `alpha`: **optional** level of significance to control the False Discovery rate (FDR). Default is $0.05$ (i.e., $\alpha=0.05$). 7. `featureVars`: **optional** Vector of the columns which identify features. If a `SummarizedExperiment` is used for `data`, row variables will be used. Default is NULL. ## Running `CytoK()` ### cytoHDBMW SummarizedExperiment example - Identifying differentially expressed features We apply the CytoK procedure to identify the differentially expressed cluster-marker combinations in the cytoHDBMW data. To run the `CytoK()` function, we only input the data object and group factor. We obtain 3 outputs after running the `CytoK()` function. They are shown below: ```{r CytoK_output} library(cytoKernel) CytoK_output<- CytoK(cytoHDBMW,group_factor = rep(c(0, 1), c(4, 4)),lowerRho=2,upperRho=12,gridRho=4, alpha = 0.05,featureVars = NULL) CytoK_output ## Head of the data.frame containing shrunken effect sizes, shrunken ##effect size sd's, p values and adjusted p values head(CytoKFeatures(CytoK_output)) ## Head of the data.frame containing shrunken effect sizes, shrunken ##effect size sd's, p values and adjusted p values ordered by ##adjusted p values from low to high head(CytoKFeaturesOrdered(CytoK_output)) ## Percent of differentially expressed features CytoKDEfeatures(CytoK_output) ``` ## Filtering the data by differentially expressed features ```{r byFeatures} ## Filtering the data by reproducible features CytoKDEData_HD<- CytoKDEData(CytoK_output, by = "features") CytoKDEData_HD ``` ## Heatmap of the expression matrix This heatmap illustrates the expression profiles with differentially expressed features (cluster-marker combinations) on the rows and patients on the columns from the kernel-based score test implemented using the `CytoK` function. The differentially expressed data (matrix) can be extracted using the (`CytoKDEData`) function (see above). The generic heatmap of the differentially expressed expression matrix can be plotted using the `plotCytoK()` function. Any specific meta information can also be added using the `ComplexHeatmap` package. An illustration is shown below where the cluster ids are separately added onto the heatmap generated by the `plotCytoK()` function. ```{r fig.cap="Differentially expressed (top 10) cluster-marker data using cytoKernel", plot-CytoK-Heatmap} heatmap1<- plotCytoK(CytoK_output, group_factor = rep(c(0, 1), c(4, 4)),topK=10, featureVars = NULL) featureOrderedExtracted<- CytoKFeaturesOrdered(CytoK_output) rowmeta_cluster<- featureOrderedExtracted$cluster topK<- 10 rowmeta_clusterTopK<- rowmeta_cluster[seq_len(topK)] library(ComplexHeatmap) heatmap2<- Heatmap(rowmeta_clusterTopK, name = "cluster") heatmap2+heatmap1 ``` # Session Info ```{r session-info} sessionInfo() ``` # References