---
title: Using BumpyMatrix objects
author:
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
package: BumpyMatrix
date: "Revised: December 15, 2020"
output:
  BiocStyle::html_document:
    toc_float: yes
vignette: >
  %\VignetteIndexEntry{The BumpyMatrix class}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
library(BiocStyle)
set.seed(0)
```

# Overview

The `BumpyMatrix` class is a two-dimensional object where each entry contains a non-scalar object of constant type/class but variable length.
This can be considered to be raggedness in the third dimension, i.e., "bumpiness".
The `BumpyMatrix` is intended to represent complex data that has zero-to-many mappings between individual data points and each feature/sample,
allowing us to store it in Bioconductor's standard 2-dimensional containers such as the `SummarizedExperiment`.
One example could be to store transcript coordinates for highly multiplexed FISH data;
the dimensions of the `BumpyMatrix` can represent genes and cells while each entry is a data frame with the relevant x/y coordinates.

# Construction

A variety of `BumpyMatrix` subclasses are implemented but the most interesting is probably the `BumpyDataFrameMatrix`.
This is an S4 matrix class where each entry is a `DataFrame` object, i.e., Bioconductor's wrapper around the `data.frame`.
To demonstrate, let's mock up some data for our hypothetical FISH experiment:

```{r}
library(S4Vectors)
df <- DataFrame(
    x=rnorm(10000), y=rnorm(10000), 
    gene=paste0("GENE_", sample(100, 10000, replace=TRUE)),
    cell=paste0("CELL_", sample(20, 10000, replace=TRUE))
)
df 
```

We then use the `splitAsBumpyMatrix()` utility to easily create our `BumpyDataFrameMatrix` based on the variables on the x- and y-axes.
Here, each row is a gene, each column is a cell, and each entry holds all coordinates for that gene/cell combination.

```{r}
library(BumpyMatrix)
mat <- splitAsBumpyMatrix(df[,c("x", "y")], row=df$gene, column=df$cell)
mat
mat[1,1][[1]]
```

We can also set `sparse=TRUE` to use a more efficient sparse representation, which avoids explicit storage of empty `DataFrame`s.
This may be necessary for larger datasets as there is a limit of `r .Machine$integer.max` (non-empty) entries in each `BumpyMatrix`.

```{r}
chosen <- df[1:100,]
smat <- splitAsBumpyMatrix(chosen[,c("x", "y")], row=chosen$gene, 
    column=chosen$cell, sparse=TRUE)
smat
```

# Basic operations

The `BumpyMatrix` implements many of the standard matrix operations, e.g., `nrow()`, `dimnames()`, the combining methods and transposition.

```{r}
dim(mat)
dimnames(mat)
rbind(mat, mat)
cbind(mat, mat)
t(mat)
```

Subsetting will yield a new `BumpyMatrix` object corresponding to the specified submatrix.
If the returned submatrix has a dimension of length 1 and `drop=TRUE`, the underlying `CompressedList` of values (in this case, the list of `DataFrame`s) is returned.

```{r}
mat[c("GENE_2", "GENE_20"),]
mat[,1:5]
mat["GENE_10",]
```

For `BumpyDataFrameMatrix` objects, we have an additional third index that allows us to easily extract an individual column of each `DataFrame` into a new `BumpyMatrix`.
In the example below, we extract the x-coordinate into a new `BumpyNumericMatrix`:

```{r}
out.x <- mat[,,"x"]
out.x
out.x[,1]
```

Common arithmetic and logical operations are already implemented for `BumpyNumericMatrix` subclasses.
Almost all of these operations will act on each entry of the input object (or corresponding entries, for multiple inputs) 
and produce a new `BumpyMatrix` of the appropriate type.

```{r}
pos <- out.x > 0
pos[,1]
shift <- 10 * out.x + 1
shift[,1]
out.y <- mat[,,"y"]
greater <- out.x < out.y
greater[,1]
diff <- out.y - out.x
diff[,1]
```

# Advanced subsetting

When subsetting a `BumpyMatrix`, we can use another `BumpyMatrix` containing indexing information for each entry.
Consider the following code chunk:

```{r}
i <- mat[,,'x'] > 0 & mat[,,'y'] > 0
i
i[,1]
sub <- mat[i]
sub
sub[,1]
```

Here, `i` is a `BumpyLogicalMatrix` where each entry is a logical vector.
When we do `x[i]`, we effectively loop over the corresponding entries of `x` and `i`, using the latter to subset the `DataFrame` in the former.
This produces a new `BumpyDataFrameMatrix` containing, in this case, only the observations with positive x- and y-coordinates.

For `BumpyDataFrameMatrix` objects, subsetting to a single field in the third dimension will automatically drop to the type of the underlying column of the `DataFrame`.
This can be stopped with `drop=FALSE` to preserve the `BumpyDataFrameMatrix` output:

```{r}
mat[,,'x']
mat[,,'x',drop=FALSE]
```

In situations where we want to drop the third dimension but not the first two dimensions (or vice versa), we use the `.dropk` argument.
Setting `.dropk=FALSE` will ensure that the third dimension is not dropped, as shown below:

```{r}
mat[1,1,'x']
mat[1,1,'x',.dropk=FALSE]
mat[1,1,'x',drop=FALSE]
mat[1,1,'x',.dropk=TRUE,drop=FALSE]
```

Subset replacement is also supported, which is most useful for operations to modify specific fields:

```{r}
copy <- mat
copy[,,'x'] <- copy[,,'x'] * 20
copy[,1]
```

# Additional operations

Some additional statistical operations are also implemented that will usually produce an ordinary matrix.
Here, each entry corresponds to the statistic computed from the corresponding entry of the `BumpyMatrix`.

```{r}
mean(out.x)[1:5,1:5] # matrix
var(out.x)[1:5,1:5] # matrix
```

The exception is with operations that naturally produce a vector, in which case a matching 3-dimensional array is returned:

```{r}
quantile(out.x)[1:5,1:5,]
range(out.x)[1:5,1:5,]
```

Other operations may return another `BumpyMatrix` if the output length is variable:

```{r}
pmax(out.x, out.y) 
```

`BumpyCharacterMatrix` objects also have their own methods for `grep()`, `tolower()`, etc. to manipulate the strings in a convenient manner.

# Session information {-}

```{r}
sessionInfo()
```