---
title: "Introduction to MultiDataSet"
subtitle: "Carles Hernandez-Ferrer, Carlos Ruiz-Arenas, Juan R. González"
author: |
  | Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain
  | Bioinformatics Research Group in Epidemiology             
  | (<http://www.creal.cat/brge.htm>)
date: "`r Sys.Date()`"
package: "`r pkg_ver('MultiDataSet')`"
output: 
  BiocStyle::html_document:
    number_sections: true
    toc: yes
    fig_caption: yes
    fig_height: 3
    fig_width: 4
vignette: >
  %\VignetteIndexEntry{Introduction to MultiDataSet}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
# Introduction

`MultiDataSet` is a new class designed to manage different omic datasets with samples in common. The main purpose when developing `MultiDataSet` was to ease the integration of multiple biological datasets. 

`MultiDataSet` is based in Bioconductor's framework and it can work with S4 classes derived from `eSet` or from `SummarizedExperiment`. The following data is extracted from each set that is added to a `MultiDataSet`:

* The matrix or matrices of data
* The phenotypic data
* The feature data
* A function to recover the original set

It should be taken into account that phenotypic data is stored independently for each set. This allows storing variables with the same name but different values in different sets (e.g. technical variables). Another fact is that feature data is stored in the form of an `AnnotatedDataFrame` and `GenomicRanges`. This design allows to quickly perform subsets using `GenomicRanges` while preserving the possibility of storing sets that do not have genomic coordinates (e.g. metabolic or exposure data).  

In this document, addition of sets and subsetting will be presented. Advanced features such as creating new functions to add sets or developing integration functions using `MultiDataSet` are covered in other documents. 
 
In the code below, the libraries needed in this tutorial are loaded:

```{r Load libraries, message=FALSE, warning = FALSE}
library(MultiDataSet)
library(MEALData)
library(minfiData)
library(GenomicRanges)
``` 

## Create a new MultiDataSet

`MultiDataSet` objects should be created prior to adding any object using the constructor:

```{r New Multi}
multi <- createMultiDataSet()
multi
``` 

The function `names` recovers the names of sets included in the `MultiDataSet`. Right now, the object is empty so there are no names:

```{r Names empty Multi}
names(multi)
length(names(multi))
``` 

# Adding sets 

Sets can be added to `MultiDataSet` using two classes of functions: general and specific. General functions add any `eSet` or `SummarizedExperiment` while specific functions add more specific objects (e.g. `ExpressionSet`).

## General functions

General functions directly interact with the `MultiDataSet` and change its content. They only check if the incoming set is an `eSet` or a `SummarizedExperiment`. Due to their flexibility, they are thought to be used by developers to create specific functions. 

`MultiDataSet` contains two general functions: `add_eset` and `add_rse`. They work similarly but they are adapted to the particularities of `eSet` and `SummarizedExperiment`. Therefore, their common features will only be covered in the add eSet section. 

### Add eSet

`add_eset` is the general function to add eSet-derived classes. This function has three important arguments. `object` is the `MultiDataSet` where the set will be added, `set` is the `eSet` that will be added and `dataset.type` is the type of the new set. The next lines will illustrate its use by adding an `ExpressionSet` from `r Rpackage("MEALData")`:

```{r add_eset 10}
data(eset)
eset
multi2 <- add_eset(multi, eset, dataset.type = "expression")
multi2
multi
``` 

The print of multi2 shows the names of the sets that has been added and, for each set, the number of features and samples and if it has a rowRanges.  It should be noticed that `add_eset` **does not modify** the `MultiDataSet` passed in the `object` argument. Consequently, multi is still empty. This property is common of all the functions used to add sets to the `MultiDataSet`. 

By default, the name of the incoming set is equal to dataset.type. If we want to add another set of the same type, we can use the argument `dataset.name` to differentiate them. As an example, we will add the same `ExpressionSet` of the previous example but with another name:

```{r add_eset 2}
multi2 <- add_eset(multi2, eset, dataset.type = "expression", dataset.name = "new")
multi2
``` 

If `dataset.name` is used, the resulting name is "dataset.type+dataset.name". With this strategy, we can have different datasets of the same type and we can still retrieve those datasets corresponding to the same type of data. 

In order to assure sample consistency across the difference datasets, we use a common sample identifier  across all datasets. Sample identifier should be introduced by adding a column called `id` in the phenotypic data (`phenoData`) of the object. If it is not already present, it is created by default using the sample names of the given set. 

Because our `ExpressionSet` does not contain this column, a warning is raised. To solve it, we can manually add an `id` to our dataset: 

```{r add_eset add id to eset}
eset2 <- eset
eset2$id <- 1:61
multi2 <- add_eset(multi, eset2, dataset.type = "expression")
multi2
``` 

There are three additional arguments: `warnings`, `overwrite` and `GRanges`. `warnings` can be used to enable or disable the warnings. `overwrite` is used when we want to add a set with a name that is currently used. If TRUE, the set is substituted. If FALSE, nothing is changed:

```{r add_eset overwrite, error=TRUE}
eset2 <- eset[, 1:10]
multi2 <- add_eset(multi, eset, dataset.type = "expression", warnings = FALSE)
multi2
multi2 <- add_eset(multi2, eset2, dataset.type = "expression", warnings = FALSE, overwrite = FALSE)
multi2
multi2 <- add_eset(multi2, eset2, dataset.type = "expression", warnings = FALSE, overwrite = TRUE)
multi2
``` 

Finally, `GRanges` argument is used add a `GenomicRanges` with the annotation. By default, a `GenomicRanges` will be generated from the set's `fData`. With this parameter, we can directly supply a `GenomicRanges` or, if the annotation of our dataset cannot be transformed to a `GenomicRanges` (e.g. proteomic data), we can set this parameter to NA:

```{r add_eset GRanges}
multi2 <- add_eset(multi, eset, dataset.type = "expression", warnings = FALSE, GRanges = NA)
multi2
``` 

Now, we can see that rowRanges is NO for expression. The implications will be described in the _Filtering by `GenomicRanges`_ section.  

### Add SummarizedExperiment

`SummarizedExperiment`s are added using `add_rse`. Its arguments and behavior are very similar to those of `add_eset`. The only difference is that there is `GRanges` argument (annotation data is already in the form of a `GenomicRanges`). To exemplify its use, a `GenomicRatioSet` (a `r Rpackage("minfi")` class) will be created and added:

```{r add_rse overwrite}
data("MsetEx")
MsetEx2 <- MsetEx[1:100, ] ### Subset the original set to speed up
GRSet <- mapToGenome(ratioConvert(MsetEx2)) ## Convert MethylSet (eSet-derived) to GenomicRatioSet
GRSet
multi <- createMultiDataSet()
multi2 <- add_rse(multi, GRSet, dataset.type = "methylation", warnings = FALSE)
multi2
``` 

## Specific functions

Specific functions are designed to add specific datasets to `MultiDataSet`. They call general functions to add the data and they usually perform several checks (e.g: checking the class of the set or checking fData's columns). As a result, only sets with some features can be introduced to `MultiDataSet` and no later checks on data structure are required. Specific functions should always be used by users to ensure that the sets are properly added to `MultiDataSet`. 

In `MultiDataSet` we have introduced four specific functions: `add_genexp`, `add_rnaseq`, `add_methy` and `add_snps`. All these functions has two arguments: `object` with the `MultiDataSet` and a second argument with the incoming set. The name of the second argument depends on the specific function (e.g: gexpSet for `add_genexp`, snpSet for `add_snps`...). Despite we will only show examples of `add_genexp` and `add_snps`, the other specific functions share the same behavior and features. 

`add_genexp` adds an `ExpressionSet` to the slot "expression". We will use the `ExpressionSet` of `r Rpackage("MEALData")` as example:

```{r add genexep error, error = TRUE}
multi <- createMultiDataSet()
multi2 <- add_genexp(multi, eset)
eset
``` 

`add_genexp` requires that the fData of the incoming `ExpressionSet` has the columns `chromosome`, `start` and `end`. Our `eset` object has the columns `start` and `end` but the chromosome is labeled with `chr`. We should fix it before continue:

```{r add genexep 0}
fvarLabels(eset)[1] <- "chromosome"
multi <- createMultiDataSet()
multi2 <- add_genexp(multi, eset)
multi2
``` 


Given that `add_genexp` calls `add_eset`, the arguments of `add_eset` can also be used but `dataset.type` and `GRanges`. Let's add the same `ExpressionSet` to another slot using `dataset.name`:

```{r add genexp }
multi2 <- add_genexp(multi2, eset, dataset.name = "2")
multi2
``` 

`add_snps` adds a `SnpSet` to the slot "snps" of a `MultiDataSet`. Snps data is in the form of `SnpMatrix` in `r Rpackage("MEALData")`, so we need to convert it first to `SnpSet`:

```{r create snpset }
data(snps)
SnpSet <- new("SnpSet", call = snps$genotypes)
fData(SnpSet) <- snps$map
SnpSet
```

If we try now to add this set to a `MultiDataSet`:

```{r add snpset1, error = TRUE}
multi2 <- add_snps(multi, SnpSet)
```

This error rises when `makeGRangesFromDataFrame` is unable to create a `GenomicRanges` from the `fData`. If we take a look to `SnpSet`'s `fData`:

```{r snpset1 fData}
head(fData(SnpSet))
```

There are two columns that can be used as chromosome. To simplify it, we will remove the first column of `fData` and we will try again:

```{r add snpset2}
fData(SnpSet) <- fData(SnpSet)[, -1]
multi2 <- add_snps(multi, SnpSet)
multi2
```

Now, the method has worked and the set has been successfully added. This case exemplifies a checking in the structure of the `fData` of the incoming set by a specific function. 

# Subsetting

Subsetting of `MultiDataSet`s can be done by samples, by tables or using a `GenomicRanges`. In order to illustrate these operations, we will use the `ExpressionSet` and the methylation data of `r Rpackage("MEALData")`. These sets contain data from common samples. First, we will create a `MethylationSet` from the matrix of beta values and the phenotypes and then we add the sets to a `MultiDataSet`. The `ExpressionSet` will also be added to another slot but setting GRanges = NA:

```{r Subsetting intro }
mset <- prepareMethylationSet(betavals, pheno)
multi <- createMultiDataSet()
multi <- add_methy(multi, mset)
multi <- add_genexp(multi, eset)
multi <- add_eset(multi, eset, dataset.type = "test", GRanges = NA)
multi
```

The expression data contains 61 samples and 21916 features and the methylation data 61 samples and 451383 CpGs. 

## Samples 

Subsetting by samples can be done in two different ways. The first option is to introduce a vector of sample ids. `MultiDataSet` has the operator `[` overloaded and samples are the first element:

```{r subset samples}
samples <- sampleNames(eset)[1:25]
multi[samples, ]
```

Samples' subsetting returns, for each set, all the samples that are present in the filtering vector. In our example, we selected the first 25 samples of the `ExpressionSet`. In the `MethylationSet`, only 23 of these samples were present. 

We can also select only those samples that are present in all the datasets with the function `commonSamples`. This method returns a new `MultiDataSet` but only with the common samples:  

```{r subset common samples}
commonSamples(multi)
length(intersect(sampleNames(eset), sampleNames(mset)))
```

The resulting `MultiDataSet` contains 57 samples for expression and methylation, the same that the intersection between the sample names of the original sets. 

## Tables

We can select the datasets of a `MultiDataSet` using their names. They should be placed in the second position of `[`: 

```{r subset tables}
multi[, "expression"]
multi[, c("methylation", "test")]
```

If we want to retrieve the original object, we can set `drop = TRUE` or use the `[[` operator:

```{r select tables}
multi[["expression"]]
multi[, "expression", drop = TRUE]
```

## GenomicRanges

Finally, `MultiDataSet` can be filtered by `GenomicRanges`. In this case, only those features inside the range will be returned and those datasets without `GenomicRanges` data will be discarded. The GenomicRanges should be placed in the third position of `[`: 

```{r Genomic Ranges}
range <- GRanges("chr17:1-100000")
multi[, , range]
```

As a consequence of filtering by `GenomicRanges`, the set "test" that did not have rowRanges have been discarded. If the `GenomicRanges` contains more than one range, features present in any of the ranges are selected:

```{r Genomic Ranges 2}
range2 <- GRanges(c("chr17:1-100000", "chr17:1000000-2000000"))
multi[, , range2]
```


## Multiple subsetting

These three operations can be combined to apply the three filters. In this case, first the sets are selected, then the samples and lastly the features:

```{r combined}
multi[samples, "expression", range]
multi[samples, "methylation", range, drop = TRUE]
```

## Advanced subsetting

The base R function `subset` can be used to perform advanced subsetting. This function can be used to filter the features by a column each dataset feature data. For instance, we can use this function to select all the features associated to a gene:

```{r Advanced genes}
subset(multi, genes == "SLC35E2")
```

This line returns a `MultiDataSet` with the features associated to the gene SLC35E2. The expression uses `genes` because it is a column that is common to the datasets include in `multi`. This function accepts any expression that returns a logical. Therefore, we can also use the `%in%` operator or include more than one expression:

```{r Advanced genes 2}
subset(multi, genes %in% c("SLC35E2", "IPO13", "TRPV1"))
subset(multi, genes == "EEF1A1" | genes == "LPP")
```

A similar approach can be used for selection samples with a common phenotype. In this case, we should pass the expression in the third argument and the column must also be present in the phenodata of the datasets:


```{r Advanced pheno}
subset(multi, , gender == "female")
```

With this line of code, we can select all the women of the study. Both subsetting can be applied at the same time:

```{r Combined advanced}
subset(multi, genes == "SLC35E2", gender == "female")
```