---
title: "cmapR Tutorial"
author:
- name: "Ted Natoli"
  email: ted.e.natoli@gmail.com
date: "1/10/2020"
package: cmapR
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{cmapR Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8} 
abstract: >
  A tutorial on using the cmapR package to parse and maniplate data in various formats used by the Connectivity Map, mainly annotated matrices stored as GCT or GCTX.
  

---

# Introduction

This notebook will walk through some of the basic functionality in the cmapR package, which is largely centered around working with matrix data in [GCT and GCTX format](https://doi.org/10.1093/bioinformatics/bty784), commonly used by the [Connectivity Map (CMap) project](https://clue.io). 


# Installation

The cmapR package can be installed from Bioconductor by running the following commands in an R session.
```{r}
# TODO: fill in once package has been accepted on bioconductor
```
cmapR source code can be also obtained from [github](https://github.com/cmap/cmapR). 


# Loading the cmapR package

cmapR can be loaded within an R session or script just like any other package.
```{r}
library(cmapR)
```


# GCT objects in R

The main class of objects in `cmapR` is the `GCT` class. The `GCT` object contains a data `matrix` and (optionally) row and column annotations as `data.frame` objects. cmapR has comes with an example GCT object called `ds` (short for dataset). We can view its structure by simply typing its name.
```{r}
ds
```
`GCT` objects contain the following components, or `slots`.

* `mat` - the data matrix
* `rdesc` - a `data.frame` of row annotations, with one row per matrix row
* `cdesc` - a `data.frame` of column annotations, with one row per matrix column
* `rid` - a character vector of unique row identifiers
* `cid` - a character vector of unique column identifiers
* `src` - a character string indicating the source (usually a file path) of the data

## Accessing GCT object components

The components of a `GCT` object can be accessed or modified using a set of accessor functions.

```{r}
# access the data matrix
m <- mat(ds)

# access the row and column metadata
rdesc <- meta(ds, dimension = "row")
cdesc <- meta(ds, dimension = "column")

# access the row and column ids
rid <- ids(ds, dimension = "row")
cid <- ids(ds, dimension = "column")
```

```{r}
# update the matrix data to set some values to zero
# note that the updated matrix must be the of the same dimensions as 
# the current matrix
m[1:10, 1:10] <- 0
mat(ds) <- m

# replace row and column metadata
meta(ds, dimension = "row") <- data.frame(x=sample(letters, nrow(m),
                                                   replace=TRUE))
meta(ds, dimension = "column") <- data.frame(x=sample(letters, ncol(m),
                                                   replace=TRUE))

# replace row and column ids
ids(ds, dimension = "row") <- as.character(seq_len(nrow(m)))
ids(ds, dimension = "column") <- as.character(seq_len(ncol(m)))

# and let's look at the modified object
ds
```


## Parsing GCTX files

You can parse both GCT and GCTX files using the `parse_gctx` method. This method will read the corresponding GCT or GCTX file and return an object of class `GCT` into your R session.

### Parsing the entire file

If the file is small enough to fit in memory, you can parse the entire file at once.
```{r}
# create a variable to store the path to the GCTX file
# here we'll use a file that's internal to the cmapR package, but
# in practice this could be any valid path to a GCT or GCTX file
ds_path <- system.file("extdata", "modzs_n25x50.gctx", package="cmapR")
my_ds <- parse_gctx(ds_path)
```

You can view the structure of this newly-created GCT object just by typing its name.
```{r}
my_ds
```
As you can see from looking at the `@mat` slot, this small dataset has 50 rows and 25 columns.

### Parsing a susbset of the file

**Note:** only GCTX files (not GCT) support parsing subsets.

When working with large GCTX files, it is usually not possible to read the entire file into memory all at once. In these cases, it's helpful to read subsets of the data. These subsets can be defined by numeric row or column index, as shown below.
```{r}
# read just the first 10 columns, using numeric indices
(my_ds_10_columns <- parse_gctx(ds_path, cid=1:10))
```
As expected, we see that we've now got a 10-column dataset.

More commonly, we'll want to identify a subset of the data that is of particular interest and read only those rows and/or columns. In either case, we'll use the `rid` and/or `cid` arguments to `parse_gctx` to extract only the data we want. In this example, we'll use the GCTX file's embedded column annotations to identify the columns corresponding to the compound *vemurafenib* and then read only those columns. We can extract these annotations using the `read_gctx_meta` function.
```{r}
# read the column metadata
col_meta <- read_gctx_meta(ds_path, dim="col")

# figure out which signatures correspond to vorinostat by searching the 'pert_iname' column
idx <- which(col_meta$pert_iname=="vemurafenib")

# read only those columns from the GCTX file by using the 'cid' parameter
vemurafenib_ds <- parse_gctx(ds_path, cid=idx)
```

In the example above we used numeric column indices, but the `rid` and `cid` arguments also accept character vectors of ids. The following is equally valid.
```{r}
# get a vector of character ids, using the id column in col_meta
col_ids <- col_meta$id[idx]
vemurafenib_ds2 <- parse_gctx(ds_path, cid=col_ids)
identical(vemurafenib_ds, vemurafenib_ds2)
```

## Creating a GCT object from existing workspace objects

It's also possible to create a `GCT` object from existing objects in your R workspace. You will minimally need to have a matrix object, but can also optionally include `data.frame`s of row and column annotations. This is done using the `new` constructor function.
```{r}
# initialize a matrix object
# note that you either must assign values to the rownames and colnames
# of the matrix, or pass them,
# as the 'rid' and 'cid' arguments to GCT"
m <- matrix(stats::rnorm(100), ncol=10)
rownames(m) <- letters[1:10]
colnames(m) <- LETTERS[1:10]
(my_new_ds <- new("GCT", mat=m))
```

```{r}
# we can also include the row/column annotations as data.frames
# note these are just arbitrary annotations used to illustrate the function call
rdesc <- data.frame(id=letters[1:10], type=rep(c(1, 2), each=5))
cdesc <- data.frame(id=LETTERS[1:10], type=rep(c(3, 4), each=5))
(my_new_ds <- new("GCT", mat=m, rdesc=rdesc, cdesc=cdesc))
```


## Adding annotations to a GCT object

When working with `GCT` objects, it's often convenient to have the row and column annotations embedded as the `rdesc` and `cdesc` slots, respectively. If these metadata are stored separatly from the GCTX file itself, you can read them in separately and then apply them to the `GCT` object using the `annotate.gct` function. Let's parse the dataset again, only this time reading only the matrix. We'll then apply the column annotations we read in previously.
```{r}
# we'll use the matrix_only argument to extract just the matrix
(my_ds_no_meta <- parse_gctx(ds_path, matrix_only = TRUE))
```
Note in this case that cmapR still populates 1-column `data.frames` containing the row and column ids, but the rest of the annotations have been omitted.

Now we'll apply the column annotations using `annotate.gct`.
```{r}
# note we need to specifiy which dimension to annotate (dim)
# and which column in the annotations corresponds to the column
# ids in the matrix (keyfield)
(my_ds_no_meta <- annotate_gct(my_ds_no_meta, col_meta, dim="col",
                               keyfield="id"))
```
Note how now the `cdesc` slot is populated after annotating.


## Slicing a GCT object

Just as it's possible to read a subset of rows or columns from a GCTX file, it is also possible to extract a subset of rows or columns from a `GCT` object in memory. This is done with the `subset_gct` function. Just like `parse_gctx`, this function uses `rid` and `cid` parameters to determine which rows and columns to extract. Let's extract the *vemurafenib* columns from the `my_ds` object in memory.
```{r}
# in memory slice using the cid parameter
vemurafenib_ds3 <- subset_gct(my_ds,
                             cid=which(col_meta$pert_iname=="vemurafenib"))
identical(vemurafenib_ds, vemurafenib_ds3)
```


## Melting GCT objects

It's often useful to have data stored in long form `data.frame` objects, especially for compatibility with plotting libraries like [`ggplot2`](http://docs.ggplot2.org/current/). It's possible to convert a `GCT` object into this long form by using the `melt_gct` function, which relies on the `melt` function in the [`data.table`](https://cran.r-project.org/web/packages/data.table/) package.
```{r}
# melt to long form
vemurafenib_ds3_melted <- melt_gct(vemurafenib_ds3)
```

```{r}
# plot the matrix values grouped by gene
library(ggplot2)
ggplot(vemurafenib_ds3_melted, aes(x=pr_gene_symbol, y=value)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))
```


## Merging two GCT objects

You can combine two independent `GCT` objects using the `merge_gct` function. Note that it is important to specify which dimension (row or column) you wish to merge on and that the two `GCT` objects in question share one common dimension.


## Math operations on GCT objects

The `@mat` slot of a GCT object is a base R matrix object, so it's possible to perform standard math operations on this matrix. This matrix can be accessed directly, i.e. `my_ds@mat`, but can also be extracted using the `get_gct_matrix` function from `cmapR`. Below are a few simple examples, but these can easily be extended, particulary through use of the `apply` function.

```{r}
# extract the data matrix from the my_ds object
m <- mat(my_ds)
```

Now let's perform a few simple math operations
```{r}
# compute the row and column means
row_means <- rowMeans(m)
col_means <- colMeans(m)
message("means:")
head(row_means)
head(col_means)

# using 'apply', compute the max of each row and column
row_max <- apply(m, 1, max)
col_max <- apply(m, 2, max)
message("maxes:")
head(row_max)
head(col_max)
```

### GCT-specific math functions

`cmapR` also contains a handful of math functions designed specifically for operating on `GCT` objects.
```{r}
# transposing a GCT object - also swaps row and column annotations
(my_ds_transpose <- transpose_gct(my_ds))

# converting a GCT object's matrix to ranks
# the 'dim' option controls the direction along which the ranks are calculated
my_ds_rank_by_column <- rank_gct(my_ds, dim="col")

# plot z-score vs rank for the first 25 genes (rows)
ranked_m <- mat(my_ds_rank_by_column)
plot(ranked_m[1:25, ],
     m[1:25, ],
     xlab="rank",
     ylab="differential expression score",
     main="score vs. rank")
```


## Writing GCT objects to disk
`GCT` objects can be written to disk either in GCT or GCTX format using the `write_gct` and `write_gctx` functions, respectively.
```{r}
# write 'my_ds' in both GCT and GCTX format
write_gct(my_ds, "my_ds")
write_gctx(my_ds, "my_ds")

# write.gctx can also compress the dataset upon write,
# which can be controlled using the 'compression_level' option.
# the higher the value, the greater the compression, but the
# longer the read/write time
write_gctx(my_ds, "my_ds_compressed", compression_level = 9)
```


## Converting GCT objects to SummarizedExperiment objects

The `GCT` class is quite similar in spirit to the `SummarizedExperiment` class from the `SummarizedExperiment` package (citation). Converting a `GCT` object to a `SummarizedExeriment` object is straightforward, as shown below.
```{r}
# ds is an object of class GCT
(se <- as(ds, "SummarizedExperiment"))
```


# Session Info
```{r}
sessionInfo()
```