%\VignetteEngine{knitr::knitr} % \VignetteIndexEntry{RnaSeqSampleSize: Sample size estimation by real data} % \VignettePackage{RnaSeqSampleSize} \documentclass[12pt]{article} \textwidth 6.75in \textheight 9.5in \topmargin -.875in \oddsidemargin -.06in \evensidemargin -.06in <>= BiocStyle::latex() @ \usepackage{hyperref} \hypersetup{ colorlinks=true, %set true if you want colored links linktoc=all, %set to all if you want both sections and subsections linked linkcolor=blue, %choose some color if you want links to stand out } \begin{document} \title{RnaSeqSampleSize: Sample size estimation based on real RNA-seq data} \bioctitle[RnaSeqSampleSize: Sample size estimation based on real RNA-seq data]{RnaSeqSampleSize: Sample size estimation based on real RNA-seq data} \author{Shilin Zhao\footnote{zhaoshilin@gmail.com}} \maketitle \begin{abstract} In this vignette, we demonstrated the application of \Biocpkg{RnaSeqSampleSize} as an sample size estimation tool for RNA-seq data. A user friendly web interface is also provided at \url{http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/} for researchers not familar with R. \\ \\ \Biocpkg{RnaSeqSampleSize} package provides the following features: \begin{itemize} \item Estimation of sample size or power by single read count and dispersion; \item Estimation of sample size or power by prior real data; \item Visualization of sample size and power by power curves; \item Optimization by power or sample size matrix; \end{itemize} \end{abstract} \newpage \tableofcontents \newpage \section{Introduction} Sample size estimation is the most important issue in the design of RNA sequencing experiments. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-seq experiments. The false discovery rate for statistic tests should be controlled. At the same time, the thousands of genes have widely distributed read counts and dispersions, which were often estimated by experience or set at the most conservative values in previous sample size estimation methods. As a result, the estimated sample size will be inaccurate or over-estimated. \\ \\ To solve these issues, we developed a sample size estimation method based on the distributions of gene read counts and dispersions from real data. Datasets from the user's preliminary experiments or the Cancer Genome Atlas (TCGA) can be used as reference. The read counts and their related dispersions will be selected randomly from the reference based on their distributions, and from that, the power and sample size will be estimated and summarized. \section{User friendly web interface} A user friendly web interface for \Biocpkg{RnaSeqSampleSize} package is provided at \url{http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/}. Most of the functions in Examples section can be performed in this website. \section{Examples} First we will load the \Biocpkg{RnaSeqSampleSize} package. <>= library(RnaSeqSampleSize) @ \subsection{Estimation of sample size or power by single read count and dispersion} \subsubsection{Power estimation} For example, if we are estimating the power of finding significant genes for RNA-seq data with specified sample size, and we have the following parameters: \begin{itemize} \item Number of samples in each group: 63; \item Minimal fold change between two groups: 2; \item Minimal average read counts: 5; \item Maximal dispersion: 0.5; \item False discovery rate: 0.01; \end{itemize} As a result, the estimated power is 0.8 by \Rfunction{est\_power} function. It means that we have 80\% probability to find the significant genes with 63 samples in each group. <>= example(est_power) @ \subsubsection{Sample size estimation} For example, if we are estimating the sample size for RNA-seq data to achieve desired power of finding significant genes, and we have the following parameters: \begin{itemize} \item Desired power of finding significant genes: 0.8; \item Minimal fold change between two groups: 2; \item Minimal average read counts: 5; \item Maximal dispersion: 0.5; \item False discovery rate: 0.01; \end{itemize} As a result, the estimated sample size is 63 by \Rfunction{sample\_size} function. It means that if we want to have 80\% probability to find the significant genes, we need 63 samples in each group. <>= example(sample_size) @ \subsection{Estimation of sample size or power by reference data} \subsubsection{Power estimation with datasets in RnaSeqSampleSizeData package} \Biocexptpkg{RnaSeqSampleSizeData} package contains the read counts and dispersion distribution from some real datasets and can be used as prior data for sample size or power estimation. They can be called with following names: <>= data(package="RnaSeqSampleSizeData")$results[,"Item"] @ For example, if we are estimating the power of finding significant genes for RNA-seq data with specified sample size, and we have the following parameters: \begin{itemize} \item Number of samples in each group: 65; \item Minimal fold change between two groups: 2; \item Prior data: TCGA READ data, stored in \Biocexptpkg{RnaSeqSampleSizeData} package, can be used with name TCGA\_READ; \item False discovery rate: 0.01; \end{itemize} Here we demonstrated the power estimation by prior data in three different situations. \begin{itemize} \item If we are intesested in all genes, we can use repNumber parameter to specify random number of genes to perform power estimation; <>= est_power_distribution(n=65,f=0.01,rho=2, distributionObject="TCGA_READ",repNumber=5) @ Please note here the parameter repNumber was very small (5) to make the example code faster. We suggest repNumber should be at least set as 100 in real analysis. \item If we are only intesested in a list of genes, we can use selectedGenes parameter to specify the list of genes to perform power estimation; <>= #Power estimation based on some interested genes. #We use storeProcess=TRUE to return the details for all selected genes. selectedGenes<-names(TCGA_READ$pseudo.counts.mean)[c(1,3,5,7,9,12:30)] powerDistribution<-est_power_distribution(n=65,f=0.01,rho=2, distributionObject="TCGA_READ", selectedGenes=selectedGenes, storeProcess=TRUE) str(powerDistribution) mean(powerDistribution$power) @ \item If we are only intesested a specified pathway, we can use pathway and species parameters to specify the genes in a pathway to perform power estimation. <>= powerDistribution<-est_power_distribution(n=65,f=0.01,rho=2, distributionObject="TCGA_READ",pathway="00010", minAveCount=1,storeProcess=TRUE) mean(powerDistribution$power) @ \end{itemize} As a result, we use \Rfunction{est\_power\_distribution} function and find the estimated power is 0.91 for random genes, 0.81 for specified gene list, and 0.77 for genes in Glycolysis and Gluconeogenesis (pathway 00010) pathway. \subsubsection{Sample size estimation with datasets in RnaSeqSampleSizeData package} For example, if we are estimating the sample size for RNA-seq data to achieve desired power of finding significant genes, and we have the following parameters: \begin{itemize} \item Desired power of finding significant genes: 0.8; \item Minimal fold change between two groups: 2; \item Prior data: TCGA READ data, stored in \Biocexptpkg{RnaSeqSampleSizeData} package, can be used with name TCGA\_READ; \item False discovery rate (FDR): 0.01; \end{itemize} As a result, we use \emph{sample\_size\_distribution} function and find the estimated sample size is 41 for random genes. <>= sample_size_distribution(power=0.8,f=0.01,distributionObject="TCGA_READ", repNumber=5,showMessage=TRUE) @ Please note here the parameter repNumber was very small (5) to make the example code faster. We suggest repNumber should be at least set as 100 in real analysis. \subsubsection{Sample size or power estimation with user's prior dataset} For example, if the user has a RNA-seq data with 10000 genes and 10 samples as prior dataset: <>= #Generate a 10000*10 RNA-seq data as prior dataset set.seed(123) dataMatrix<-matrix(sample(0:3000,100000,replace=TRUE),nrow=10000,ncol=10) colnames(dataMatrix)<-c(paste0("Control",1:5),paste0("Treatment",1:5)) row.names(dataMatrix)<-paste0("gene",1:10000) head(dataMatrix) @ Then we are estimating the power of finding significant genes for RNA-seq data with specified sample size, and we have the following parameters: \begin{itemize} \item Number of samples in each group: 65; \item Minimal fold change between two groups: 2; \item Prior data: User's prior dataset with 10000 genes and 10 samples; \item False discovery rate: 0.01; \end{itemize} We will use \Rfunction{est\_count\_dispersion} to estitamete the gene read count and dispersion distribution of user's prior dataset. And then \Rfunction{est\_power\_distribution} function will be used to estimate power. <>= #Estitamete the gene read count and dispersion distribution dataMatrixDistribution<-est_count_dispersion(dataMatrix, group=c(rep(0,5),rep(1,5))) #Power estimation by read count and dispersion distribution est_power_distribution(n=65,f=0.01,rho=2, distributionObject=dataMatrixDistribution,repNumber=5) @ As a result, we can find the estimated power is 0.91. Please note here the parameter repNumber was very small (5) to make the example code faster. We suggest repNumber should be at least set as 100 in real analysis. \subsection{Power curve visualization} For example, if we are going to compare the power of finding significant genes for different false discovery rate, and we have the following parameters: \begin{itemize} \item Number of samples in each group: 63; \item Minimal fold change between two groups: 2; \item Minimal average read counts: 5; \item Maximal dispersion: 0.5; \item False discovery rate: 0.01 and 0.05; \end{itemize} <>= result1<-est_power_curve(n=63, f=0.01, rho=2, lambda0=5, phi0=0.5) result2<-est_power_curve(n=63, f=0.05, rho=2, lambda0=5, phi0=0.5) plot_power_curve(list(result1,result2)) @ As a result, the relation between power and sample size can be estimated by \Rfunction{est\_power\_curve} function and the power curves can be generated by \Rfunction{ plot\_power\_curve} function. \subsection{Optimization by power or sample size matrix} For example, if the budget is limited, we need to balance the number of replications and sequence depth. We can use the \Rfunction{optimize\_parameter} function to find the relation between sample size, read counts, and estimated power. And then the optimized parameters can be determined. <>= result<-optimize_parameter(fun=est_power,opt1="n", opt2="lambda0",opt1Value=c(3,5,10,15,20), opt2Value=c(1:5,10,20)) @ As a result, the estimated power distribution indicates that the number of replications plays a more significant role in determining the power than the number of read counts. \end{document}