--- title: "OVESEG User Manual" author: "Lulu Chen" date: "`r Sys.Date()`" bibliography: oveseg.bib output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{OVESEG User Manual} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) [@oveseg]. To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex. `OVESEG` package provides functions to compute OVESEG-test statistics, derive component weights in the mixture null distribution model and estimate p-values from weightedly aggregated permutations. While OVESEG-test p-values can be used to detect MGs, component weights of hypotheses also help portrait all kinds of upregulation patterns among tissue/cell types. # Quick Start The function `OVESEGtest()` includes all necessary steps to obtain OVESEG-test p-values. You need to specify the expression matrix, tissue/cell type labels, number of permutations, and parallel option. For microarray data, the expressions need to be log2-transformed prior to the test. For RNAseq count data, the counts need to be transformed to logCPM (log2-counts per million) prior to the test and use `limma::voom()` to incorporate the mean-variance relationship. Parallel option is set according to `r Biocpkg("BiocParallel")` package. ```{r, eval=FALSE} #Microarray rtest <- OVESEGtest(y, group, NumPerm=999, BPPARAM=BiocParallel::SnowParam()) #RNAseq yvoom <- limma::voom(count, model.matrix(~0+factor(group))) rtest <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, BPPARAM=BiocParallel::SnowParam()) ``` # OVESEG-test ## Datatypes and Input Format Theoretically, OVESEG-test can be applied to any molecular expression data types as long as they can be modeled by normal distribution after a transformation, such as log2-transformation for microarray and proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA methylation beta values. If mean-variance relationship exists after transformation, `limma::vooma()/voom()` need to be performed firstly and the resulted weight matrix is used as the input of `OVESEG`. Requirements for the input expression data: - be transformed (log2/logCPM/logit2). - be already processed by normalization and batch effect removal. - no missing values; the molecules containing missing values should be removed prior to OVESEG-test. - no low-expressed molecules; otherwise, the results will be affected. The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix. ## Microarray data We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from [GSE28490](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28490)) as an example to show OVESEG-test on microarray data. ```{r, echo=TRUE} library(OVESEG) #Import data data(RocheBT) y <- RocheBT$y #5000*15 matrix group <- RocheBT$group #15 labels #filter low-expressed probes groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x])) groupMeanMax <- apply(groupMean, 1, max) keep <- (groupMeanMax > log2(64)) y <- y[keep,] #OVESEG-test rtest1 <- OVESEGtest(y, group, NumPerm = 999, BPPARAM=BiocParallel::SnowParam()) ``` Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean expression is less than a threshold even in the highest expressed group. If mean-variance relationship is obvious, we can consider to apply `limma::vooma()` firstly. ```{r, echo=TRUE} #OVESEG-test yvooma <- limma::vooma(y, model.matrix(~0+factor(group))) rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999, BPPARAM=BiocParallel::SnowParam()) ``` ## RNAseq count data We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from [GSE60424](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60424)) as an example to show OVESEG-test on RNAseq count data. ```{r, echo=TRUE} library(OVESEG) #Import data data(countBT) count <- countBT$count #10000*60 matrix group <- countBT$group #60 labels #filter low-expressed probes groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x])) groupMeanMax <- apply(groupMean, 1, max) keep <- (groupMeanMax > 2) count <- count[keep,] #OVESEG-test lib.size <- max(colSums(count)) yvoom <- limma::voom(count, model.matrix(~0+factor(group)), lib.size = lib.size) rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, BPPARAM=BiocParallel::SnowParam()) ``` Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean count is less than a threshold even in the highest expressed group. `lib.size` and other parameters in `limma::voom()` can be set manually according to `r Biocpkg("limma")` package. ## p-values `OVESEGtest` returns three p-values: `pv.overall`, `pv.oneside` and `pv.multiside`. `pv.overall` is calculated by all permutations regardless of upregulated subtypes. `pv.oneside` is subtype-specific p-values calculated specifically for the upregulated subtype of each probe. `pv.multiside` is `pv.oneside` multiplied by K (K comparison correction) and truncated at 1. More details can be found in the paper [@oveseg]. # Useful intermediate results ## Test statistics OVESEG-test statistics are defined as $$\mathbf{\max_{k=1,...,K}\left\{min_{l \neq k}\left\{ \frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}} \right\}\right\}}$$ where $\mathbf{\mu_k(j)}$ is the mean of logarithmic expressions of gene j in subtype k. While it takes a long time to execute permutations for p-value estimation, `OVESEGtstat()` is useful if users only need test statistics for ranking genes: ```{r, echo=TRUE} #OVESEG-test statistics rtstat1 <- OVESEGtstat(y, RocheBT$group) rtstat2 <- OVESEGtstat(yvooma$E, RocheBT$group, weights=yvooma$weights) rtstat3 <- OVESEGtstat(yvoom$E, countBT$group, weights=yvoom$weights) ``` ## Posterior probabilities of null hypothesis components Null hypothesis of OVESEG-test is modeled as a mixture of multiple components, where the weights of components are estimated from posterior probabilities over all probes. Those posterior probabilities are also returned by `OVESEGtest()`. If users only want to observe probewise posterior probabilities, `postProbNull()` is helpful. ```{r, echo=TRUE} ppnull1 <- postProbNull(y, RocheBT$group) ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights) ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights) ``` By accumulating and normalizing probewise posterior probability with the function `nullDistri()`, we can also obtain the probability of one subtype being upregulated conditioned on null hypotheses. The subtype with higher probability tends to get more False Positive MGs. ```{r, echo=TRUE} nullDistri(ppnull1) nullDistri(ppnull2) nullDistri(ppnull3) ``` There are totally $\mathbf{2^K-1}$ types of expression patterns in a real dataset: probes exclusively expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of subtypes should have been filtered during pre-processing. Accumulating and normalizing probewise posterior probability of null hypotheses and of alternative hypotheses using the function `patternDistri()` can present us the ratios of all possible upregulation patterns among subtypes: ```{r, echo=TRUE} patterns <- patternDistri(ppnull1) #or patterns <- patternDistri(rtest1) ``` The ratios of patterns can be illustrated as following. ```{r, fig.height=6, fig.width=6, fig.show='hold'} library(gridExtra) library(grid) library(reshape2) library(ggplot2) gridpatterns <- function (ppnull) { K <- length(ppnull$label) cellcomp <- patternDistri(ppnull) cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),] #Bar Chart to show pattern percentage DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp) p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + theme_bw(base_size = 16) + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), plot.margin=unit(c(1,1,-0.2,1), "cm"), panel.grid.minor = element_line(size = 1), panel.grid.major = element_line(size = 1), panel.border = element_blank(), axis.ticks.x = element_blank()) + ylab("Percentage of up in certain subtype(s)") #patterns as X-axis DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)]) DF2 <- melt(DF2, id.var="Rank") p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) + geom_bar(stat = "identity") + scale_fill_brewer(palette="Set2") + theme(line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), title = element_blank(), panel.background = element_rect(fill = "white"), legend.position="bottom", legend.key.size = unit(1.2,"line"), plot.margin=unit(c(-0.2,1,1,1), "cm")) + scale_y_reverse() + guides(fill = guide_legend(nrow = 1, byrow = TRUE)) g1 <- ggplotGrob(p1) g2 <- ggplotGrob(p2) colnames(g1) <- paste0(seq_len(ncol(g1))) colnames(g2) <- paste0(seq_len(ncol(g2))) g <- gtable_combine(g1, g2, along=2) panels <- g$layout$t[grepl("panel", g$layout$name)] n1 <- length(g$heights[panels[1]]) n2 <- length(g$heights[panels[2]]) # assign new (relative) heights to the panels # notice the *4 here to get different heights g$heights[panels[1]] <- unit(n1*4,"null") g$heights[panels[2]] <- unit(n2,"null") return(g) } grid.newpage() grid.draw(gridpatterns(ppnull1)) #or grid.draw(gridpatterns(rtest1)) ``` # Notes Default variance estimator in `OVESEG` is empirical Bayes moderated variance estimator used in `r Biocpkg("limma")` (argument `alpha="moderated"`). Another option is adding a constant $\alpha$ to pooled variance estimator (argument `alpha=`$\alpha$). Setting argument `alpha=NULL` will treat all variances as the same constant. Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control. ```{r, echo=TRUE} ##multiple comparison correction pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni") pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni") ##fdr control qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue", plot=FALSE, verbose=FALSE)$qval qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue", plot=FALSE, verbose=FALSE)$qval ``` # sessionInfo ```{r, echo=TRUE} sessionInfo() ``` # References