--- title: "Effect size estimation with apeglm" date: "`r format(Sys.Date(), '%m/%d/%Y')`" author: "Anqi Zhu, Joseph G. Ibrahim, and Michael I. Love" output: rmarkdown::html_document: highlight: pygments toc: true toc_float: true abstract: | *apeglm* provides empirical Bayes shrinkage estimators for effect sizes for a variety of GLM models; *apeglm* stands for "Approximate Posterior Estimation for GLM". apeglm package version: `r packageVersion("apeglm")` bibliography: library.bib vignette: | %\VignetteIndexEntry{Effect size estimation with apeglm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set(tidy=FALSE, cache=FALSE, dev="png", message=FALSE, error=FALSE, warning=TRUE) ``` # Typical RNA-seq call from DESeq2 **Note:** the typical RNA-seq workflow for users would be to call *apeglm* estimation from within the `lfcShrink` function from the *DESeq2* package. The unevaluated code chunk shows how to obtain *apeglm* shrinkage estimates after running `DESeq`. See the DESeq2 vignette for more details. The `lfcShrink` wrapper function takes care of many details below, and unifies the interface for multiple shrinkage estimators. The coefficient to shrink can be specified either by name or by number (following the order in `resultsNames(dds)`). Be aware that *DESeq2*'s `lfcShrink` interface provides LFCs on the log2 scale, while *apeglm* provides coefficients on the natural log scale. ```{r,eval=FALSE} res <- lfcShrink(dds, coef=2, type="apeglm") ``` # Acknowledgments Joshua Zitovsky contributed fast C++ code for the beta-binomial likelihood, demonstrated in a later section of the vignette. We have benefited in the development of `apeglm` from feedback or contributions from the following individuals: Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson # Example RNA-seq analysis Here we show example code which mimics what will happen inside the `lfcShrink` function when using the *apeglm* method [@Zhu2018]. Load a prepared `SummarizedExperiment`: ```{r} library(airway) data(airway) head(assay(airway)) ``` For demonstration, we will use 5000 genes of the `airway` dataset, the first from those genes with at least 10 counts across all samples. ```{r} keep <- head(which(rowSums(assay(airway)) > 0), 5000) airway <- airway[keep,] ``` First run a *DESeq2* differential expression analysis [@Love2014] (size factors, and dispersion estimates could similarly be estimated using *edgeR*): ```{r} library(DESeq2) dds <- DESeqDataSet(airway, ~cell + dex) dds$dex <- relevel(dds$dex, "untrt") dds <- DESeq(dds) res <- results(dds) ``` Defining data and parameter objects necessary for `apeglm`. We must multiply the coefficients from *DESeq2* by a factor, because *apeglm* provides natural log coefficients. Again, this would be handled inside of `lfcShrink` in *DESeq2* for a typical RNA-seq analysis. ```{r} x <- model.matrix(design(dds), colData(dds)) param <- dispersions(dds) mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE) offset <- matrix(log(sizeFactors(dds)), ncol=ncol(dds), nrow=nrow(dds),byrow=TRUE) ``` # Running apeglm Here `apeglm` on 5000 genes takes less than a minute on a laptop. It scales with number of genes, the number of samples and the number of variables in the design formula, where here we have 5 coefficients (one for the four cell cultures and one for the difference due to dexamethasone treatment). We provide `apeglm` with the *SummarizedExperiment* although the function can also run on a *matrix* of counts or other observed data. We specify a `coef` as well as a `threshold` which we discuss below. Note that we multiple the `threshold` by `log(2)` to convert from log2 scale to natural log scale. ```{r} library(apeglm) system.time({ fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset) }) str(fit$prior.control) ``` There are faster implementations of apeglm specifically for negative binomial likelihoods. The version `nbinomR` is ~5 times faster than the default `general`. ```{r} system.time({ fitR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR") }) ``` The version `nbinomCR` is ~10 times faster than the default `general`. ```{r} system.time({ fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR") }) ``` The version `nbinomC` returns only the MAP coefficients and can be ~50-100 times faster than the default `general`. The MAP coefficients are the same as returned by `nbinomCR` above, we just skip the calculation of posterior SD. A variant of `nbinomC` is `nbinomC*` which includes random starts. ```{r} system.time({ fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC") }) ``` Among other output, we have the estimated coefficients attached to the ranges of the *SummarizedExperiment* used as input: ```{r} class(fit$ranges) mcols(fit$ranges, use.names=TRUE) ``` We can compare the coefficients from *apeglm* with those from *DESeq2*. *apeglm* provides coefficients on the natural log scale, so we must convert to log2 scale by multiplying by `log2(exp(1))`. Note that *DESeq2*'s `lfcShrink` function converts *apeglm* coefficients to the log2 scale internally. ```{r} system.time({ res.shr <- lfcShrink(dds, coef=5) }) DESeq2.lfc <- res.shr$log2FoldChange apeglm.lfc <- log2(exp(1)) * fit$map[,5] ``` Here we plot *apeglm* estimators against *DESeq2*: ```{r apevsdeseq} plot(DESeq2.lfc, apeglm.lfc) abline(0,1) ``` Here we plot *MLE*, *DESeq2* and *apeglm* estimators against the mean of normalized counts: ```{r maplots, fig.width=9, fig.height=3.5} par(mfrow=c(1,3)) lims <- c(-8,8) hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2)) xlab <- "mean of normalized counts" plot(res$baseMean, res$log2FoldChange, log="x", ylim=lims, main="MLE", xlab=xlab) hline() plot(res$baseMean, DESeq2.lfc, log="x", ylim=lims, main="DESeq2", xlab=xlab) hline() plot(res$baseMean, apeglm.lfc, log="x", ylim=lims, main="apeglm", xlab=xlab) hline() ``` # Specific coefficients Note that p-values and FSR define different events, and are not on the same scale. An FSR of 0.5 means that the estimated sign is as bad as random guess. ```{r pvalcomp, fig.width=8, fig.height=4} par(mfrow=c(1,2),mar=c(5,5,1,1)) plot(res$pvalue, fit$fsr, col="blue", xlab="DESeq2 pvalue", ylab="apeglm local FSR", xlim=c(0,1), ylim=c(0,.5)) abline(0,1) plot(-log10(res$pvalue), -log10(fit$fsr), xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR", col="blue") abline(0,1) ``` The s-value was proposed by @Stephens2016, as a statistic giving the aggregate false sign rate for tests with equal or lower s-value than the one considered. We recommend using a lower threshold on s-values than typically used for adjusted p-values, for example one might be interested in sets with 0.01 or 0.005 aggregate FSR. ```{r pvalcomp2} plot(res$padj, fit$svalue, col="blue", xlab="DESeq2 padj", ylab="apeglm svalue", xlim=c(0,.2), ylim=c(0,.02)) ``` More scrutiny can be applied by using an LFC threshold greater than zero, and asking for the probability of a "false-sign-or-small" (FSOS) event: that the effect size is not further from zero in distance than the threshold amount. We can run the `svalue` function on these per-gene probabilities to produce s-values that bound the FSOS rate for sets of genes. By specifying `threshold=log(2) * 1` above, `apeglm` will then output a vector `thresh` in the results list that gives the per-gene probabilities of false-sign-or-small events. ```{r maplotthresh} s.val <- svalue(fit$thresh) cols <- ifelse(s.val < .01, "red", "black") plot(res$baseMean, log2(exp(1)) * fit$map[,5], log="x", col=cols, xlab=xlab, ylab="LFC") abline(h=c(-1,0,1), col=rgb(1,0,0,.5), lwd=2) ``` # Modeling zero-inflated counts We have created a separate GitHub repository giving an example of how the *apeglm* estimator can be used for Zero-Inflated Negative Binomial data. This approach uses the *zinbwave* method and Bioconductor package to estimate the probability of each zero belonging to the excess zero component. We compare using a Negative Binomial likelihood with the excess zeros down-weighted and using a Zero-Inflated Negative Binomial likelihood. These two approaches with *apeglm* perform similarly but we note that the first approach involves less additional code and is faster to compute. # Modeling ratios of counts We also show an short example using an alternative likelihood to the negative binomial. Suppose we have allele-specific counts for n=20 vs 20 samples across 5000 genes. We can define a binomial model and test for the allelic balance across groups of samples. Here we will *simulate* allele counts from our existing dataset for demonstration. We spike in 10 genes with strong allelic imbalance (instead of an allelic ratio close to 0.5, these will have a ratio of 0.75). ```{r} library(emdbook) n <- 20 f <- factor(rep(1:2,each=n)) mu <- ifelse(res$baseMean > 50, res$baseMean, 50) set.seed(1) cts <- matrix(rnbinom(nrow(dds)*2*n, mu=mu, size=1/dispersions(dds)), ncol=2*n) theta <- runif(nrow(cts),1,1000) prob <- rnorm(nrow(cts),.5,.05) # close to 0.5 ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob, size=cts, theta=rep(theta,ncol(cts))), nrow=nrow(cts)) idx <- 1:10 idx2 <- which(f == 2) theta[idx] <- 1000 prob[idx] <- 0.75 # the spiked in genes have an allelic ratio of 0.75 ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=prob[idx], size=cts[idx,idx2], theta=theta[idx]), nrow=length(idx)) ``` We first need to estimate MLE coefficients and standard errors. One option to run `apeglm` would be to define a beta-binomial likelihood function which uses the total counts as a parameter, and the logit function as a link. And then this function could be provided to the `log.lik` argument. ```{r} betabinom.log.lik <- function(y, x, beta, param, offset) { xbeta <- x %*% beta p.hat <- (1+exp(-xbeta))^-1 dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE) } ``` However, `apeglm` has faster C++ implementations for the beta-binomial, which were implemented and tested by Joshua Zitovsky. Here, using `method="betabinCR"` is 3 times faster than using the R-defined `log.lik`, and the `"betabinCR"` method also scales significantly better with more samples and more coefficients. As with the negative binomial, `method="betabinC"` can be used if the standard errors are not needed, and this will be 5 times faster than the R-defined `log.lik` approach on this dataset. The following code performs two iterations of estimating the MLE coefficients, and then estimating the beta-binomial dispersion. ```{r} theta.hat <- 100 # rough initial estimate of dispersion x <- model.matrix(~f) niter <- 2 for (i in 1:niter) { param <- cbind(theta.hat, cts) system.time({ fit.mle <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param, no.shrink=TRUE, log.link=FALSE, method="betabinCR") }) theta.hat <- bbEstDisp(success=ase.cts, size=cts, x=x, beta=fit.mle$map, minDisp=1, maxDisp=500) } ``` We can then plot the MLE estimates over the mean: ```{r mlemaplot} coef <- 2 xlab <- "mean of normalized counts" plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="log odds") points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3) ``` Now we run the posterior estimation, including a prior on the second coefficient: ```{r} mle <- cbind(fit.mle$map[,coef], fit.mle$sd[,coef]) param <- cbind(theta.hat, cts) system.time({ fit2 <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param, coef=coef, mle=mle, threshold=0.5, log.link=FALSE, method="betabinCR") }) ``` In the `apeglm` plot, we color in red the genes with a low aggregate probability of false-sign-or-small (FSOS) events (s-value < .01), where we've again defined "small" on the log odds scale using the `threshold` argument above. ```{r asemaplot, fig.width=8, fig.height=4} par(mfrow=c(1,2)) ylim <- c(-1,1.5) s.val <- svalue(fit2$thresh) # small-or-false-sign value plot(res$baseMean, fit.mle$map[,coef], main="MLE", log="x", xlab=xlab, ylab="log odds", ylim=ylim) points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5)) cols <- ifelse(s.val < .01, "red", "black") plot(res$baseMean, fit2$map[,coef], main="apeglm", log="x", xlab=xlab, ylab="log odds", col=cols, ylim=ylim) points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5)) ``` ```{r} logit <- function(x) log(x/(1-x)) logit(.75) table(abs(logit(prob[s.val < .01])) > .5) ``` # Session Info ```{r} sessionInfo() ``` # References