--- title: "SCBN Tutorial" author: "Yan Zhou" package: SCBN date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{SCBN Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction This package provides a statistical normalization method and differential expression analysis for RNA-seq data between different species. It consider the different gene lengths and unmapped genes between species, and formulate the problem from the perspective of hypothesis testing and search for an optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. The methods on this package are described in the article *A statistical normalization method and differential expression analysis for RNA-seq data between different species* by Yan Zhou, Jiadi Zhu, Tiejun Tong, Junhui Wang, Bingqing Lin, Jun Zhang (2018, pending publication). The scale based normalization (SCBN) method is developed for analyzing cross species RNE-seq data, which is different from the data in same species. We have implemented the SCBN method via a set of R functions. We make a R package named SCBN, and give a tutorial for the package. The method consist three steps. Step 1: Data Pre-processing; Step 2: Calculating scaling factor for data; Step 3: Calculating p-values for each orthologous genes and select significants. We employ a simulation dataset to illustrate the usage of the SCBN package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 3.5.0. # Preparations To install the SCBN package into your R environment, start R and enter: ```{r, eval=FALSE} install.packages("BiocManager") BiocManager::install("SCBN") ``` Then, the SCBN package is ready to load. ```{r} library(SCBN) ``` # Data format In order to reproduce the presented SCBN workflow, the package includes the example data sets, which is generated by function *generateDataset()*. The next we will give an example for how to generate simulation dataset: ```{r} data(orthgenes) orthgenes[, 6:9] <- round(orthgenes[, 6:9]) orthgenes1 <- orthgenes[!(is.na(orthgenes[,6])|is.na(orthgenes[,7])| is.na(orthgenes[,8])|is.na(orthgenes[,9])), ] sim_data <- generateDataset(commonTags=5000, uniqueTags=c(100, 300), unmapped=c(400, 200),group=c(1, 2), libLimits=c(.9, 1.1)*1e6, empiricalDist=orthgenes1[, 6], genelength=orthgenes1[, 2], randomRate=1/100, pDifferential=.05, pUp=.5, foldDifference=2) ``` There are two dataset in the data subdirectory of SCBN package, that is **orthgenes.RData** and **sim_data**. **orthgenes.RData** is a real dataset come from *The evolution of gene expression levels in mammalian organs* by Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. **sim_data** is a simulation dataset, which is generated by *generateDataset()* according to the previous steps. ```{r} data(orthgenes) head(orthgenes) ``` **orthgenes.RData** includes 9 columns, * the first and third columns are the ID for two species; * the second and fourth columns are the gene length for two species; * from sixth to ninth columns are RNA-seq reads for two species. ```{r} data(sim_data) head(sim_data) ``` **sim_data** includes 4 columns, * the first and third columns are gene length for two species; * the second and fourth columns are reads for two species. # Calculate scaling factor for data For different species, we need to consider not only the total read counts but also the different gene numbers and gene lengths. Therefore, we propose SCBN method, by utilizing the available knowledge of housekeeping genes, it get the optimal scaling factor by minimizing the deviation between the empirical and the nominal type I error. ```{r} data(sim_data) factor <- SCBN(orth_gene=sim_data, hkind=1:1000, a=0.05) factor ``` The output for *SCBN()* is "*median_val*" and "*scbn_val*". *median_val* is a scaling factor which is calculated by method in *The evolution of gene expression levels in mammalian organs* by Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, et al. Nature. 2011;478:343-348. *scbn_val* is a scaling factor which in our paper. We assess the performance of two method, and find the proposed method outperforms the other method. # Calculate p-values for each orthologous genes and select significants Getting the scaling factor, we calculate p-values with adjusted sage.test function. The step is as follows: ```{r} data(sim_data) orth_gene <- sim_data hkind <- 1:1000 scale <- factor$scbn_val x <- orth_gene[, 2] y <- orth_gene[, 4] lengthx <- orth_gene[, 1] lengthy <- orth_gene[, 3] n1 <- sum(x) n2 <- sum(y) p_value <- sageTestNew(x, y, lengthx, lengthy, n1, n2, scale) head(p_value) ``` Getting p-values for each orthologous genes, we choose differentially expressed orthologous genes with p-values. We assume the genes to be differentially expressed genes when p-values less than pre-setting cutoffs, such as $10^{-3}$ or $10^{-5}$.