--- title: Matrix representations to support decomposition author: - name: Aaron Lun affiliation: Cancer Research UK Cambridge Institute, Cambridge, United Kingdom date: "Revised: 26 May 2019" output: BiocStyle::html_document: toc_float: true package: BiocSingular vignette: > %\VignetteIndexEntry{2. Matrix classes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` ```{r setup, echo=FALSE, message=FALSE} library(BiocSingular) set.seed(100) ``` # Overview `r Biocpkg("BiocSingular")` implements several useful `DelayedMatrix` backends for dealing with principal components analysis (PCA). This vignette aims to provide an overview of these classes and how they can be used in other packages to improve efficiency prior to or after PCA. # The `DeferredMatrix` class As previously discussed, we can defer the column centering and scaling of a matrix. The `DeferredMatrix` class provides a matrix representation that is equivalent to the output of `scale()`, but without actually performing the centering/scaling operations. This is useful when dealing with sparse matrix representations, as naive centering would result in loss of sparsity and increased memory use. ```{r} library(Matrix) a <- rsparsematrix(10000, 1000, density=0.01) object.size(a) centering <- rnorm(ncol(a)) scaling <- runif(ncol(a)) y <- DeferredMatrix(a, centering, scaling) y object.size(y) # 'a' plus the size of 'centering' and 'scaling'. ``` At first glance, this seems to be nothing more than a variant on a `DelayedMatrix`. However, the real advantage of the `DeferredMatrix` comes when performing matrix multiplication. Multiplication is applied to the underlying sparse matrix, and the effects of centering and scaling are applied on the result. This allows us to achieve much greater speed than `r Biocpkg("DelayedArray")`'s block-processing mechanism, which realizes blocks of the matrix into dense arrays prior to multiplication. ```{r} v <- matrix(runif(ncol(a)*2), ncol=2) system.time(X <- y %*% v) X ``` The cost of this speed is that the resulting matrix product is less numerically stable. Recovering the effect of centering involves the subtraction of two (potentially large) inner products, which increases the risk of catastrophic cancellation. Nonetheless, some reduction in accuracy may be worth the major speed-up when dense realization is avoided. Note that it is also possible to nest `DeferredMatrix` objects within each other. This allows users to center and scale on both dimensions, as shown below. ```{r} centering2 <- rnorm(nrow(a)) scaling2 <- runif(nrow(a)) y2 <- DeferredMatrix(t(a), centering2, scaling2) # centering on rows of 'a'. y3 <- DeferredMatrix(t(y2), centering, scaling) # centering on columns. y3 ``` Other operations will cause the `DeferredMatrix` to collapse gracefully into `DelayedMatrix`, leading to block processing for further calculations. # The `LowRankMatrix` class Once a PCA is performed, it is occasionally desirable to obtain a low-rank approximation of the input matrix by taking the cross-product of the rotation vectors and PC scores. Naively doing so results in the formation of a dense matrix of the same dimensions as the input. This may be prohibitively memory-consuming for a large data set. Instead, we can construct a `LowRankMatrix` class that mimics the output of the cross-product without actually computing it. ```{r} a <- rsparsematrix(10000, 1000, density=0.01) out <- runPCA(a, rank=5, BSPARAM=IrlbaParam(deferred=TRUE)) # deferring for speed. recon <- LowRankMatrix(out$rotation, out$x) recon ``` This is useful for convenient extraction of row- or column vectors without needing to manually perform a cross-product. A `LowRankMatrix` is thus directly interoperable with downstream procedures (e.g., for visualization) that expect a matrix of the same dimensionality as the input. ```{r} summary(recon[,1]) summary(recon[2,]) ``` Again, most operations will cause the `LowRankMatrix` to collapse gracefully into `DelayedMatrix` for further processing. # The `ResidualMatrix` class A common analysis in genomics applications involves regressing out uninteresting factors of variation prior to a PCA. However, doing so naively would discard aspects of the underlying matrix representation. The most obvious example is the loss of sparsity when a dense matrix of residuals is computed, increasing memory usage and compute time in downstream applications. The `ResidualMatrix` class provides an alternative to explicit calculation of the residuals. The constructor takes a matrix of input values and a design matrix, where residuals are conceptually computed by fitting the linear model to the columns of the input matrix. However, the actual calculation of the residuals is delayed until they are explictly required. ```{r} design <- model.matrix(~gl(5, 1000)) y0 <- rsparsematrix(nrow(design), 30000, 0.01) resids <- ResidualMatrix(y0, design) resids ``` In fact, matrix multiplication steps involving a `ResidualMatrix` do not even need to compute the residuals explicitly at all. This means that `ResidualMatrix` objects can be efficiently used in approximate PCA algorithms based on multiplication. ```{r} system.time(pc.out <- runPCA(resids, 10, BSPARAM=IrlbaParam())) ``` Other operations will cause the `ResidualMatrix` to collapse into `DelayedMatrix` for further processing. # Session information ```{r} sessionInfo() ```