The HighlyReplicatedRNASeq package provides functions to access the count matrix from bulk RNA-seq studies with many replicates. For example,the study from Schurch et al. (2016) has data on 86 samples of S. cerevisiae in two conditions.

1 Load Data

To load the dataset, call the Schurch16() function. It returns a SummarizedExperiment:

schurch_se <- HighlyReplicatedRNASeq::Schurch16()
#> snapshotDate(): 2022-04-19
#> see ?HighlyReplicatedRNASeq and browseVignettes('HighlyReplicatedRNASeq') for documentation
#> loading from cache
#> see ?HighlyReplicatedRNASeq and browseVignettes('HighlyReplicatedRNASeq') for documentation
#> loading from cache

schurch_se
#> class: SummarizedExperiment 
#> dim: 7126 86 
#> metadata(0):
#> assays(1): counts
#> rownames(7126): 15S_rRNA 21S_rRNA ... tY(GUA)O tY(GUA)Q
#> rowData names(0):
#> colnames(86): wildtype_01 wildtype_02 ... knockout_47 knockout_48
#> colData names(4): file_name condition replicate name

An alternative approach that achieves exactly the same is to load the data directly from ExperimentHub

library(ExperimentHub)
eh <- ExperimentHub()
#> snapshotDate(): 2022-04-19
records <- query(eh, "HighlyReplicatedRNASeq")
records[1]           ## display the metadata for the first resource
#> ExperimentHub with 1 record
#> # snapshotDate(): 2022-04-19
#> # names(): EH3315
#> # package(): HighlyReplicatedRNASeq
#> # $dataprovider: Geoff Barton's group on GitHub
#> # $species: Saccharomyces cerevisiae BY4741
#> # $rdataclass: matrix
#> # $rdatadateadded: 2020-04-03
#> # $title: Schurch S. cerevesiae Highly Replicated Bulk RNA-Seq Counts
#> # $description: Count matrix for bulk RNA-sequencing dataset from 86 S. cere...
#> # $taxonomyid: 1247190
#> # $genome: Ensembl release 68
#> # $sourcetype: tar.gz
#> # $sourceurl: https://github.com/bartongroup/profDGE48
#> # $sourcesize: NA
#> # $tags: c("ExperimentHub", "ExperimentData", "ExpressionData",
#> #   "SequencingData", "RNASeqData") 
#> # retrieve record with 'object[["EH3315"]]'
count_matrix <- records[["EH3315"]]  ## load the count matrix by ID
#> see ?HighlyReplicatedRNASeq and browseVignettes('HighlyReplicatedRNASeq') for documentation
#> loading from cache
count_matrix[1:10, 1:5]
#>          wildtype_01 wildtype_02 wildtype_03 wildtype_04 wildtype_05
#> 15S_rRNA           2          12          31           8          21
#> 21S_rRNA          20          76         101          99         128
#> HRA1               3           2           2           2           3
#> ICR1              75         123         107         157          98
#> LSR1              60         163         233         163         193
#> NME1              13          14          23          13          29
#> PWR1               0           0           0           0           0
#> Q0010              0           0           0           0           0
#> Q0017              0           0           0           0           0
#> Q0032              0           0           0           0           0

2 Explore Data

It has 7126 genes and 86 samples. The counts are between 0 and 467,000.

summary(c(assay(schurch_se, "counts")))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0      89     386    1229     924  467550

To make the data easier to work with, I will “normalize” the data. First I divide it by the mean of each sample to account for the differential sequencing depth. Then, I apply the log() transformation and add a small number to avoid taking the logarithm of 0.

norm_counts <- assay(schurch_se, "counts")
norm_counts <- log(norm_counts / colMeans(norm_counts) + 0.001)

The histogram of the transformed data looks very smooth:

hist(norm_counts, breaks = 100)

Finally, let us take a look at the MA-plot of the data and the volcano plot:

wt_mean <- rowMeans(norm_counts[, schurch_se$condition == "wildtype"])
ko_mean <- rowMeans(norm_counts[, schurch_se$condition == "knockout"])

plot((wt_mean+ ko_mean) / 2, wt_mean - ko_mean,
     pch = 16, cex = 0.4, col = "#00000050", frame.plot = FALSE)
abline(h = 0)


pvalues <- sapply(seq_len(nrow(norm_counts)), function(idx){
  tryCatch(
    t.test(norm_counts[idx, schurch_se$condition == "wildtype"], 
         norm_counts[idx, schurch_se$condition == "knockout"])$p.value,
  error = function(err) NA)
})

plot(wt_mean - ko_mean, - log10(pvalues),
     pch = 16, cex = 0.5, col = "#00000050", frame.plot = FALSE)

3 Reference

Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., … Barton, G. J. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? Rna, 22(6), 839–851. https://doi.org/10.1261/rna.053959.115

4 Session Info

sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] HighlyReplicatedRNASeq_1.8.0 ExperimentHub_2.4.0         
#>  [3] AnnotationHub_3.4.0          BiocFileCache_2.4.0         
#>  [5] dbplyr_2.1.1                 SummarizedExperiment_1.26.0 
#>  [7] Biobase_2.56.0               GenomicRanges_1.48.0        
#>  [9] GenomeInfoDb_1.32.0          IRanges_2.30.0              
#> [11] S4Vectors_0.34.0             BiocGenerics_0.42.0         
#> [13] MatrixGenerics_1.8.0         matrixStats_0.62.0          
#> [15] BiocStyle_2.24.0            
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.2                    sass_0.4.1                   
#>  [3] bit64_4.0.5                   jsonlite_1.8.0               
#>  [5] bslib_0.3.1                   shiny_1.7.1                  
#>  [7] assertthat_0.2.1              interactiveDisplayBase_1.34.0
#>  [9] highr_0.9                     BiocManager_1.30.17          
#> [11] blob_1.2.3                    GenomeInfoDbData_1.2.8       
#> [13] yaml_2.3.5                    BiocVersion_3.15.2           
#> [15] lattice_0.20-45               pillar_1.7.0                 
#> [17] RSQLite_2.2.12                glue_1.6.2                   
#> [19] digest_0.6.29                 promises_1.2.0.1             
#> [21] XVector_0.36.0                htmltools_0.5.2              
#> [23] httpuv_1.6.5                  Matrix_1.4-1                 
#> [25] pkgconfig_2.0.3               magick_2.7.3                 
#> [27] bookdown_0.26                 zlibbioc_1.42.0              
#> [29] purrr_0.3.4                   xtable_1.8-4                 
#> [31] later_1.3.0                   tibble_3.1.6                 
#> [33] KEGGREST_1.36.0               generics_0.1.2               
#> [35] ellipsis_0.3.2                withr_2.5.0                  
#> [37] cachem_1.0.6                  cli_3.3.0                    
#> [39] magrittr_2.0.3                crayon_1.5.1                 
#> [41] mime_0.12                     memoise_2.0.1                
#> [43] evaluate_0.15                 fansi_1.0.3                  
#> [45] tools_4.2.0                   lifecycle_1.0.1              
#> [47] stringr_1.4.0                 DelayedArray_0.22.0          
#> [49] AnnotationDbi_1.58.0          Biostrings_2.64.0            
#> [51] compiler_4.2.0                jquerylib_0.1.4              
#> [53] rlang_1.0.2                   grid_4.2.0                   
#> [55] RCurl_1.98-1.6                rappdirs_0.3.3               
#> [57] bitops_1.0-7                  rmarkdown_2.14               
#> [59] DBI_1.1.2                     curl_4.3.2                   
#> [61] R6_2.5.1                      knitr_1.39                   
#> [63] dplyr_1.0.8                   fastmap_1.1.0                
#> [65] bit_4.0.4                     utf8_1.2.2                   
#> [67] filelock_1.0.2                stringi_1.7.6                
#> [69] Rcpp_1.0.8.3                  vctrs_0.4.1                  
#> [71] png_0.1-7                     tidyselect_1.1.2             
#> [73] xfun_0.30

HighlyReplicatedRNASeq

2022-04-28

1 Load Data

2 Explore Data

3 Reference

4 Session Info