--- title: "***POWSC: power and sample size snalysis for single-cell RNA-seq***" shorttitle: "POWSC user guide" author: "Kenong Su @Emory, Zhijin Wu @Brown, Hao Wu @Emory" package: POWSC abstract: > This vignette introduces the usage of the Bioconductor package POWSC (POWer analysis for SCrna-seq experiment), which is specifically designed for power estimation and sample size estimation for scRNA-seq data. POWSC is a simulation-based method, to provide power evaluation and sample size recommen- dation for single-cell RNA-sequencing DE analysis. POWSC consists of a data simulator that creates realistic expres- sion data, and a power assessor that provides a comprehensive evaluation and visualization of the power and sam- ple size relationship. The data simulator in POWSC outperforms two other state-of-art simulators in capturing key characteristics of real datasets. The power assessor in POWSC provides a variety of power evaluations including stratified and marginal power analyses for DEs characterized by two forms (phase transition or magnitude tuning), under different comparison scenarios. In addition, POWSC offers information for optimizing the tradeoffs between sample size and sequencing depth with the same total reads. vignette: > %\VignetteIndexEntry{The POWSC User's Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc} output: BiocStyle::html_document: toc: true toc_float: collapsed: true smooth_scroll: true fig_width: 5 --- \tableofcontents \vspace{.1in} ## Installation and quick start ### Installation To install this package, start R (version > "4.1") and enter: ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("POWSC") ``` ### Quick start on using POWSC We use one scRNA-seq template data about ES-MEF cell types (GSE29087). In this case, we adopt one cell type (*fibro*) for the two-group comparison corresponding to the 1st scenario. ```{r setup, eval=TRUE, message=FALSE, results='hide'} suppressMessages(library(POWSC)) data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") ``` ## Background Determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq under the context of differential expression, this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq including sparsity and heterogeneity gain the challenge. ## Introduction `POWSC` is an R package designed for power assessment and sample size estimation in scRNA-seq experiment. It contains three main functionalities: (1). **Parameter Estimation**: adopted and modified from the core of [SC2P](https://github.com/haowulab/SC2P). (2). **Data Simulation**: consider two cases: paired-wise comparison and multiple comparisons. (3). **Power Evaluation**: provide both stratified (detailed) and marginal powers. For the model assumption, we adopt the mixture of zero-inflated poisson (ZIP) and log-normal poisson (LNP). The power assessor in POWSC provides a variety of power evaluations including stratified and marginal power analyses for DEs characterized by **two forms** (phase transition or magnitude tuning), under different comparison scenarios. In addition, POWSC offers information for optimizing the tradeoffs between sample size and sequencing depth with the same total reads. ## Use POWSC In the context of differential expression (DE) analysis, scientists are usually interested in two different scenarios: (1) within cell type: comparing the same cell types across biological conditions such as case vs. control, which reveals the expression change of a particular cell type under different contexts. (2) between cell types: comparing different cell types under the same condition, which identifies biomarkers to distinguish cell types. In either case, the experiment starts from a number of cells randomly picked from a tissue sample consisting of a mixture of different cell types. The only factor one can control is the total number of cells. ### For two-group comparison In the first scenario, the numbers of cells for a particular cell type in different biological conditions are often similar, barring significant changes in cell composition. It uses one cell type as the benchmark data and perturbs the genes with mixture proportion ($\pi$) and log fold changes ($lfc$) as the DE genes according two forms. ```{r two-group, eval = TRUE, message=FALSE, results='hide'} # Users can customize how many cells they need as to change the number of n. simData = Simulate2SCE(n=100, estParas1 = est_Paras, estParas2 = est_Paras) de = runDE(simData$sce, DE_Method = "MAST") estPower1 = Power_Disc(de, simData = simData) estPower2 = Power_Cont(de, simData = simData) ``` ### For multiple-group comparisons In the second scenario, the numbers of cells for distinct cell types can be very different, so the power for DE highly depends on the mixing proportions. Here, we use 1000*57 expression matrix sampled from the patient ID AB_S7 from the GSE67835 data. The original data GSE67835 can be download here: https://www.dropbox.com/s/55zdktqfqiwfs3l/GSE67835.RData?dl=0. In this case, we utilize the most abundant three cell types including oligodendrocytes, hybrid, and neurons. The underlying cell type proportion is 20%, 30%, 50% respectively. We simulate a series of datasets corresponding to different sample sizes. For each dataset, we then perform pairwise comparison and report the power evaluation for each comparison ```{r multiple-group, eval = TRUE, message=FALSE, results='hide'} data("GSE67835_AB_S7_sce") sim_size = 200 # use a large sample is preferrable cell_per = c(0.2, 0.3, 0.5) col = colData(sce) exprs = assays(sce)$counts (tb = table(colData(sce)$Patients, colData(sce)$cellTypes)) # use AB_S7 patient as example and take three cell types: astrocytes hybrid and neurons estParas_set = NULL celltypes = c("oligodendrocytes", "hybrid", "neurons") for (cp in celltypes){ print(cp) ix = intersect(grep(cp, col$cellTypes), grep("AB_S7", col$Patients)) tmp_mat = exprs[, ix] tmp_paras = Est2Phase(tmp_mat) estParas_set[[cp]] = tmp_paras } ######### ######### Simulation part ######### sim = SimulateMultiSCEs(n = sim_size, estParas_set = estParas_set, multiProb = cell_per) ######### ######### DE analysis part ######### DE_rslt = NULL for (comp in names(sim)){ tmp = runDE(sim[[comp]]$sce, DE_Method = "MAST") DE_rslt[[comp]] = tmp } ######### ######### Summarize the power result ######### pow_rslt = pow1 = pow2 = pow1_marg = pow2_marg = NULL TD = CD = NULL for (comp in names(sim)){ tmp1 = Power_Disc(DE_rslt[[comp]], sim[[comp]]) tmp2 = Power_Cont(DE_rslt[[comp]], sim[[comp]]) TD = c(TD, tmp2$TD); CD = c(CD, tmp2$CD) pow1_marg = c(pow1_marg, tmp1$power.marginal) pow2_marg = c(pow2_marg, tmp2$power.marginal) pow_rslt[[comp]] = list(pow1 = tmp1, pow2 = tmp2) pow1 = rbind(pow1, tmp1$power) pow2 = rbind(pow2, tmp2$power) } ######### ######### Demonstrate the result by heatmap ######### library(RColorBrewer); library(pheatmap) breaksList = seq(0, 1, by = 0.01) colors = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)) dimnames(pow1) = list(names(sim), names(tmp1$CD)) dimnames(pow2) = list(names(sim), names(tmp2$CD)) pheatmap(pow2, display_numbers = TRUE, color=colors, show_rownames = TRUE, cellwidth = 30, cellheight = 40, legend = TRUE, border_color = "grey96", na_col = "grey", cluster_row = FALSE, cluster_cols = FALSE, breaks = seq(0, 1, 0.01), main = "") ``` ## Session Info ```{r} sessionInfo() ```