--- title: "HighlyReplicatedRNASeq" author: Constantin Ahlmann-Eltze date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Exploring the 86 bulk RNA-seq samples from the Schurch et al. (2016) study} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The `r Biocpkg("HighlyReplicatedRNASeq")` package provides functions to access the count matrix from bulk RNA-seq studies with many replicates. For example,the study from Schurch et al. (2016) has data on 86 samples of S. *cerevisiae* in two conditions. # Load Data To load the dataset, call the `Schurch16()` function. It returns a `r Biocpkg("SummarizedExperiment")`: ```{r} schurch_se <- HighlyReplicatedRNASeq::Schurch16() schurch_se ``` An alternative approach that achieves exactly the same is to load the data directly from ExperimentHub ```{r} library(ExperimentHub) eh <- ExperimentHub() records <- query(eh, "HighlyReplicatedRNASeq") records[1] ## display the metadata for the first resource count_matrix <- records[["EH3315"]] ## load the count matrix by ID count_matrix[1:10, 1:5] ``` # Explore Data It has 7126 genes and 86 samples. The counts are between 0 and 467,000. ```{r} summary(c(assay(schurch_se, "counts"))) ``` To make the data easier to work with, I will "normalize" the data. First I divide it by the mean of each sample to account for the differential sequencing depth. Then, I apply the `log()` transformation and add a small number to avoid taking the logarithm of 0. ```{r} norm_counts <- assay(schurch_se, "counts") norm_counts <- log(norm_counts / colMeans(norm_counts) + 0.001) ``` The histogram of the transformed data looks very smooth: ```{r} hist(norm_counts, breaks = 100) ``` Finally, let us take a look at the MA-plot of the data and the volcano plot: ```{r} wt_mean <- rowMeans(norm_counts[, schurch_se$condition == "wildtype"]) ko_mean <- rowMeans(norm_counts[, schurch_se$condition == "knockout"]) plot((wt_mean+ ko_mean) / 2, wt_mean - ko_mean, pch = 16, cex = 0.4, col = "#00000050", frame.plot = FALSE) abline(h = 0) pvalues <- sapply(seq_len(nrow(norm_counts)), function(idx){ tryCatch( t.test(norm_counts[idx, schurch_se$condition == "wildtype"], norm_counts[idx, schurch_se$condition == "knockout"])$p.value, error = function(err) NA) }) plot(wt_mean - ko_mean, - log10(pvalues), pch = 16, cex = 0.5, col = "#00000050", frame.plot = FALSE) ``` # Reference * Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., … Barton, G. J. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? Rna, 22(6), 839–851. https://doi.org/10.1261/rna.053959.115 # Session Info ```{r} sessionInfo() ```