--- title: "Confident fold change" author: "Paul Harrison" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Confident fold change} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: cache: !r FALSE --- This document shows typical Topconfects usage with limma, edgeR, or DESeq2. The first step is to load a dataset. Here, we're looking at RNA-seq data that investigates the response of *Arabodopsis thaliana* to a bacterial pathogen. Besides the experimental and control conditions, there is also a batch effect. This dataset is also examined in section 4.2 of the `edgeR` user manual, and I've followed the initial filtering steps in the `edgeR` manual. ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width=7, fig.height=6) ``` ```{r load, message=FALSE, warning=FALSE} library(topconfects) library(NBPSeq) library(edgeR) library(limma) library(dplyr) library(ggplot2) data(arab) # Retrieve symbol and description for each gene info <- AnnotationDbi::select( org.At.tair.db::org.At.tair.db, rownames(arab), c("SYMBOL","GENENAME")) %>% group_by(TAIR) %>% summarize( SYMBOL=paste(unique(na.omit(SYMBOL)),collapse="/"), GENENAME=paste(unique(na.omit(GENENAME)),collapse="/")) arab_info <- info[match(rownames(arab),info$TAIR),] %>% select(-TAIR) %>% as.data.frame rownames(arab_info) <- rownames(arab) # Extract experimental design from sample names Treat <- factor(substring(colnames(arab),1,4)) %>% relevel(ref="mock") Time <- factor(substring(colnames(arab),5,5)) y <- DGEList(arab, genes=as.data.frame(arab_info)) # Keep genes with at least 3 samples having an RPM of more than 2 keep <- rowSums(cpm(y)>2) >= 3 y <- y[keep,,keep.lib.sizes=FALSE] y <- calcNormFactors(y) ``` ## limma analysis ### Standard limma analysis steps ```{r limma} design <- model.matrix(~Time+Treat) design[,] fit <- voom(y, design) %>% lmFit(design) ``` ### Apply topconfects Find largest fold changes that we can be confident of at FDR 0.05. ```{r limma_confects} confects <- limma_confects(fit, coef=4, fdr=0.05) confects ``` ### Looking at the result Here the usual `logFC` values estimated by `limma` are shown as dots, with lines to the `confect` value. ```{r fig.height=7} confects_plot(confects) ``` `confects_plot_me` overlays the confects (black) on a Mean-Difference Plot (grey) (as might be produced by `limma::plotMD`). As we should expect, the very noisy differences with low mean expression are removed if we look at the confects. ```{r} confects_plot_me(confects) ``` Let's compare this to the ranking we obtain from `topTable`. ```{r, fig.height=7} fit_eb <- eBayes(fit) top <- topTable(fit_eb, coef=4, n=Inf) rank_rank_plot(confects$table$name, rownames(top), "limma_confects", "topTable") ``` You can see that the top 19 genes from topTable are all within the top 40 for topconfects ranking, but topconfects has also highly ranked some other genes. These have a large effect size, and sufficient if not overwhelming evidence of this. An MD-plot highlighting the positions of the top 40 genes in both rankings also illustrates the differences between these two ways of ranking genes. ```{r} plotMD(fit, legend="bottomleft", status=paste0( ifelse(rownames(fit) %in% rownames(top)[1:40], "topTable ",""), ifelse(rownames(fit) %in% confects$table$name[1:40], "confects ",""))) ``` ## edgeR analysis An analysis in edgeR produces similar results. Note that only quasi-likelihood testing from edgeR is supported. ### Standard edgeR analysis ```{r edger} y <- estimateDisp(y, design, robust=TRUE) efit <- glmQLFit(y, design, robust=TRUE) ``` ### Apply topconfects A step of 0.05 is used here merely so that the vignette will build quickly. `edger_confects` calls `edgeR::glmTreat` repeatedly, which is necessarily slow. In practice a smaller value such as 0.01 should be used. ```{r edger_confects} econfects <- edger_confects(efit, coef=4, fdr=0.05, step=0.05) econfects ``` ### Looking at the result ```{r fig.height=7} confects_plot(econfects) confects_plot_me(econfects) ``` ```{r} etop <- glmQLFTest(efit, coef=4) %>% topTags(n=Inf) plotMD(efit, legend="bottomleft", status=paste0( ifelse(rownames(efit) %in% econfects$table$name[1:40], "confects ", ""), ifelse(rownames(efit) %in% rownames(etop)[1:40], "topTags ",""))) ``` ## DESeq2 analysis DESeq2 does its own filtering of lowly expressed genes, so we start from the original count matrix. The initial steps are as for a normal DESeq2 analysis. ```{r message=F, warning=F} library(DESeq2) dds <- DESeqDataSetFromMatrix( countData = arab, colData = data.frame(Time, Treat), rowData = arab_info, design = ~Time+Treat) dds <- DESeq(dds) ``` ### Apply topconfects The contrast or coefficient to test is specified as in the `DESeq2::results` function. The step of 0.05 is merely so that this vignette will build quickly, in practice a smaller value such as 0.01 should be used. `deseq2_confects` calls `results` repeatedly, and in fairness `results` has not been optimized for this. ```{r} dconfects <- deseq2_confects(dds, name="Treat_hrcc_vs_mock", step=0.05) ``` DESeq2 offers shrunken estimates of LFC. This is another sensible way of ranking genes. Let's compare them to the confect values. ```{r} shrunk <- lfcShrink(dds, coef="Treat_hrcc_vs_mock", type="ashr") dconfects$table$shrunk <- shrunk$log2FoldChange[dconfects$table$index] dconfects ``` DESeq2 filters some genes, these are placed last in the table. If your intention is to obtain a ranking of all genes, you should disable this with `deseq2_confects(..., cooksCutoff=Inf, independentFiltering=FALSE)`. ```{r} table(dconfects$table$filtered) tail(dconfects$table) ``` ### Looking at the result Shrunk LFC estimates are shown in red. ```{r fig.height=7} confects_plot(dconfects) + geom_point(aes(x=shrunk, size=baseMean, color="lfcShrink"), alpha=0.75) ``` `lfcShrink` aims for a best estimate of the LFC, whereas confect is a conservative estimate. `lfcShrink` can produce non-zero values for genes which can't be said to significantly differ from zero -- it doesn't do double duty as an indication of significance -- whereas the confect value will be `NA` in this case. The plot below compares these two quantities. Only un-filtered genes are shown (see above). ```{r} filter(dconfects$table, !filtered) %>% ggplot(aes( x=ifelse(is.na(confect),0,confect), y=shrunk, color=!is.na(confect))) + geom_point() + geom_abline() + coord_fixed() + theme_bw() + labs(color="Significantly\nnon-zero at\nFDR 0.05", x="confect", y="lfcShrink using ashr") ``` ## Comparing results ```{r fig.height=7} rank_rank_plot(confects$table$name, econfects$table$name, "limma confects", "edgeR confects") rank_rank_plot(confects$table$name, dconfects$table$name, "limma confects", "DESeq2 confects") rank_rank_plot(econfects$table$name, dconfects$table$name, "edgeR confects", "DESeq2 confects") ``` --- ```{r} sessionInfo() ```