--- title: "Overview of DelayedMatrixStats" author: "Peter Hickey" date: "Modified: 06 Feb 2021 Compiled: `r format(Sys.Date(), '%d %b %Y')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Overview of DelayedMatrixStats} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE, setup} knitr::opts_chunk$set(echo = TRUE, comment = "#>", collapse = TRUE, message = FALSE) library(BiocStyle) ``` # Overview `r Biocpkg("DelayedMatrixStats")` ports the `r CRANpkg("matrixStats")` API to work with *DelayedMatrix* objects from the `r Biocpkg("DelayedArray")` package. It provides high-performing functions operating on rows and columns of *DelayedMatrix* objects, including all subclasses such as *RleArray* (from the `r Biocpkg("DelayedArray")` package) and *HDF5Array* (from the `r Biocpkg("HDF5Array")`) as well as supporting all types of *seeds*, such as *matrix* (from the *base* package) and *Matrix* (from the `r CRANpkg("Matrix")` package). # How can DelayedMatrixStats help me? The `r Biocpkg("DelayedArray")` package allows developers to store array-like data using in-memory or on-disk representations (e.g., in HDF5 files) and provides a common and familiar array-like interface for interacting with these data. The `r Biocpkg("DelayedMatrixStats")` package is designed to make life easier for Bioconductor developers wanting to use `r Biocpkg("DelayedArray")` by providing a rich set of column-wise and row-wise summary functions. We briefly demonstrate and explain these two features using a simple example. We'll simulate some (unrealistic) RNA-seq read counts data from 10,000 genes and 20 samples and store it on disk as a *HDF5Array*: ```{r data_sim, message = FALSE} library(DelayedArray) x <- do.call(cbind, lapply(1:20, function(j) { rpois(n = 10000, lambda = sample(20:40, 10000, replace = TRUE)) })) colnames(x) <- paste0("S", 1:20) x <- realize(x, "HDF5Array") x ``` Suppose you wish to compute the standard deviation of the read counts for each gene. You might think to use `apply()` like in the following: ```{r apply} system.time(row_sds <- apply(x, 1, sd)) head(row_sds) ``` This works, but takes quite a while. Or perhaps you already know that the `r CRANpkg("matrixStats")` package provides a `rowSds()` function: ```{r matrixStats, error = TRUE} matrixStats::rowSds(x) ``` Unfortunately (and perhaps unsurprisingly) this doesn't work. `r CRANpkg("matrixStats")` is designed for use on in-memory *matrix* objects. Well, why don't we just first realize our data in-memory and then use `r CRANpkg("matrixStats")` ```{r realization} system.time(row_sds <- matrixStats::rowSds(as.matrix(x))) head(row_sds) ``` This works and is many times faster than the `apply()`-based approach! However, it rather defeats the purpose of using a *HDF5Array* for storing the data since we have to bring all the data into memory at once to compute the result. Instead, we can use `DelayedMatrixStats::rowSds()`, which has the speed benefits of `matrixStats::rowSds()`[^speed] but without having to load the entire data into memory at once[^block_size]: [^speed]: In fact, it currently uses `matrixStats::rowSds()` under the hood. [^block_size]: In this case, it loads blocks of data row-by-row. The amount of data loaded into memory at any one time is controlled by the *default block size* global setting; see `?DelayedArray::getAutoBlockSize` for details. Notably, if the data are small enough (and the default block size is large enough) then all the data is loaded as a single block, but this approach generalizes and still works when the data are too large to be loaded into memory in one block. ```{r DelayedMatrixStats} library(DelayedMatrixStats) system.time(row_sds <- rowSds(x)) head(row_sds) ``` Finally, by using `r Biocpkg("DelayedMatrixStats")` we can use the same code, (`colMedians(x)`) regardless of whether the input is an ordinary *matrix* or a *DelayedMatrix*. This is useful for packages wishing to support both types of objects, e.g., packages wanting to retain backward compatibility or during a transition period from *matrix*-based to *DelayeMatrix*-based objects. # Supported methods The initial release of `r Biocpkg("DelayedMatrixStats")` supports the complete column-wise and row-wise API `r CRANpkg("matrixStats")` API[^api]. Please see the `r CRANpkg("matrixStats")` vignette ([available online](https://cran.r-project.org/package=matrixStats/vignettes/matrixStats-methods.html)) for a summary these methods. The following table documents the API coverage and availability of ['seed-aware' methods](#seed_aware_methods) in the current version of `r Biocpkg("DelayedMatrixStats")`, where: - ✔ = Implemented in `r Biocpkg("DelayedMatrixStats")` - ☑️ = Implemented in [**DelayedArray**](http://bioconductor.org/packages/DelayedArray/) or [**sparseMatrixStats**](http://bioconductor.org/packages/sparseMatrixStats/) - ❌ = Not yet implemented [^api]: Some of the API is covered via inheritance to functionality in `r Biocpkg("DelayedArray")` ```{r API, echo = FALSE} matrixStats <- sort( c("colsum", "rowsum", grep("^(col|row)", getNamespaceExports("matrixStats"), value = TRUE))) sparseMatrixStats <- getNamespaceExports("sparseMatrixStats") DelayedMatrixStats <- getNamespaceExports("DelayedMatrixStats") DelayedArray <- getNamespaceExports("DelayedArray") api_df <- data.frame( Method = paste0("`", matrixStats, "()`"), `Block processing` = ifelse( matrixStats %in% DelayedMatrixStats, "✔", ifelse(matrixStats %in% c(DelayedArray, sparseMatrixStats), "☑️", "❌")), `_base::matrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "matrix_OR_array_OR_table_OR_numeric"), "✔", "❌"), `_Matrix::dgCMatrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "dgCMatrix"), "✔", "❌"), `_Matrix::lgCMatrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "lgCMatrix"), "✔", "❌"), `_DelayedArray::RleArray_ (_SolidRleArraySeed_) optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "SolidRleArraySeed"), "✔", "❌"), `_DelayedArray::RleArray_ (_ChunkedRleArraySeed_) optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "ChunkedRleArraySeed"), "✔", "❌"), `_HDF5Array::HDF5Matrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "HDF5ArraySeed"), "✔", "❌"), `_base::data.frame_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "data.frame"), "✔", "❌"), `_S4Vectors::DataFrame_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "DataFrame"), "✔", "❌"), check.names = FALSE) knitr::kable(api_df, row.names = FALSE) ``` # 'Seed-aware' methods {#seed_aware_methods} As well as offering a familiar API, `r Biocpkg("DelayedMatrixStats")` provides 'seed-aware' methods that are optimized for specific types of *DelayedMatrix* objects. To illustrate this idea, we will compare two ways of computing the column sums of a *DelayedMatrix* object: 1. The 'block-processing' strategy. This was developed in the `r Biocpkg("DelayedArray")` package and is available for all methods in the `r Biocpkg("DelayedMatrixStats")` through the `force_block_processing` argument 2. The 'seed-aware' strategy. This is implemented in the `r Biocpkg("DelayedMatrixStats")` and is optimized for both speed and memory but only for *DelayedMatrix* objects with certain types of *seed*. We will demonstrate this by computing the column sums matrices with 20,000 rows and 600 columns where the data have different structure and are stored in *DelayedMatrix* objects with different types of seed: - Dense data with values in $(0, 1)$ using an ordinary _matrix_ as the seed - Sparse data with values in $[0, 1)$, where $60\%$ are zeros, using a _dgCMatrix_, a sparse matrix representation from the `r CRANpkg("Matrix")` package, as the seed - Dense data in ${0, 1, \ldots, 100}$, where there are multiple runs of identical values, using a _RleArraySeed_ from the `r Biocpkg("DelayedArray")` package as the seed We use the `r CRANpkg("microbenchmark")` package to measure running time and the `r CRANpkg("profmem")` package to measure the total memory allocations of each method. In each case, the 'seed-aware' method is many times faster and allocates substantially lower total memory. ```{r benchmarking, message = FALSE, echo = TRUE, error = TRUE} library(DelayedMatrixStats) library(sparseMatrixStats) library(microbenchmark) library(profmem) set.seed(666) # ----------------------------------------------------------------------------- # Dense with values in (0, 1) # Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed # # Generate some data dense_matrix <- matrix(runif(20000 * 600), nrow = 20000, ncol = 600) # Benchmark dm_matrix <- DelayedArray(dense_matrix) class(seed(dm_matrix)) dm_matrix microbenchmark( block_processing = colSums2(dm_matrix, force_block_processing = TRUE), seed_aware = colSums2(dm_matrix), times = 10) total(profmem(colSums2(dm_matrix, force_block_processing = TRUE))) total(profmem(colSums2(dm_matrix))) # ----------------------------------------------------------------------------- # Sparse (60% zero) with values in (0, 1) # Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed # # Generate some data sparse_matrix <- dense_matrix zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix)) sparse_matrix[zero_idx] <- 0 # Benchmark dm_dgCMatrix <- DelayedArray(Matrix(sparse_matrix, sparse = TRUE)) class(seed(dm_dgCMatrix)) dm_dgCMatrix microbenchmark( block_processing = colSums2(dm_dgCMatrix, force_block_processing = TRUE), seed_aware = colSums2(dm_dgCMatrix), times = 10) total(profmem(colSums2(dm_dgCMatrix, force_block_processing = TRUE))) total(profmem(colSums2(dm_dgCMatrix))) # ----------------------------------------------------------------------------- # Dense with values in {0, 100} featuring runs of identical values # Fast, memory-efficient column sums of DelayedMatrix with Rle-based seed # # Generate some data runs <- rep(sample(100, 500000, replace = TRUE), rpois(500000, 100)) runs <- runs[seq_len(20000 * 600)] runs_matrix <- matrix(runs, nrow = 20000, ncol = 600) # Benchmark dm_rle <- RleArray(Rle(runs), dim = c(20000, 600)) class(seed(dm_rle)) dm_rle microbenchmark( block_processing = colSums2(dm_rle, force_block_processing = TRUE), seed_aware = colSums2(dm_rle), times = 10) total(profmem(colSums2(dm_rle, force_block_processing = TRUE))) total(profmem(colSums2(dm_rle))) ``` The development of 'seed-aware' methods is ongoing work (see the [Roadmap](#roadmap)), and for now only a few methods and seed-types have a 'seed-aware' method. An extensive set of benchmarks is under development at [http://peterhickey.org/BenchmarkingDelayedMatrixStats/](http://peterhickey.org/BenchmarkingDelayedMatrixStats/). # Delayed operations A key feature of a _DelayedArray_ is the ability to register 'delayed operations'. For example, let's compute `sin(dm_matrix)`: ```{r sin} system.time(sin_dm_matrix <- sin(dm_matrix)) ``` This instantaneous because the operation is not actually performed, rather it is registered and only performed when the object is _realized_. All methods in `r Biocpkg("DelayedMatrixStats")` will correctly realise these delayed operations before computing the final result. For example, let's compute `colSums2(sin_dm_matrix)` and compare check we get the correct answer: ```{r colSums2_sin} all.equal(colSums2(sin_dm_matrix), colSums(sin(as.matrix(dm_matrix)))) ``` # Roadmap {#roadmap} The initial version of `r Biocpkg("DelayedMatrixStats")` provides complete coverage of the `r CRANpkg("matrixStats")` column-wise and row-wise API[^api], allowing package developers to use these functions with _DelayedMatrix_ objects as well as with ordinary _matrix_ objects. This should simplify package development and assist authors to support to their software for large datasets stored in disk-backed data structures such as _HDF5Array_. Such large datasets are increasingly common with the rise of single-cell genomics. Future releases of `r Biocpkg("DelayedMatrixStats")` will improve the performance of these methods, specifically by developing additional 'seed-aware' methods. The plan is to prioritise commonly used methods (e.g., `colMeans2()`/`rowMeans2()`, `colSums2()`/`rowSums2()`, etc.) and the development of 'seed-aware' methods for the _HDF5Matrix_ class. To do so, we will leverage the `r Biocpkg("beachmat")` package. Proof-of-concept code has shown that this can greatly increase the performance when analysing such disk-backed data. Importantly, all package developers using methods from `r Biocpkg("DelayedMatrixStats")` will immediately gain from performance improvements to these low-level routines. By using `r Biocpkg("DelayedMatrixStats")`, package developers will be able to focus on higher level programming tasks and address important scientific questions and technological challenges in high-throughput biology.