---
title: "BiocSet: Representing Element Sets in the Tidyverse"
author: 
- name: Kayla Morrell
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
package: BiocSet
vignette: >
  %\VignetteIndexEntry{BiocSet: Representing Element Sets in the Tidyverse}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEndcoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction
`BiocSet` is a package that represents element sets in a tibble format with the
 `BiocSet` class. Element sets are read in and converted into a tibble format. 
From here, typical `dplyr` operations can be performed on the element set. 
`BiocSet` also provides functionality for mapping different ID types and 
providing reference urls for elements and sets.

# Installation
Install the most recent version from Bioconductor:

```{r bioconductor, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BiocSet")
```

The development version is also available for install from GitHub:

```{r github, eval = FALSE}
BiocManager::install("Kayla-Morrell/BiocSet")
```

Then load `BiocSet`:

```{r  load, message = FALSE}
library(BiocSet)
```
# BiocSet

## Input and Output

`BiocSet` can create a `BiocSet` object using two different input methods. The 
first is to input named character vectors of element sets. `BiocSet` returns 
three tibbles, `es_element` which contains the elements, `es_set` which contains
 the sets and `es_elementset` which contains elements and sets together. 

```{r constructor}
tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
```

The second method of creating a `BiocSet` object would be to read in a `.gmt` 
file. Using `import()`, a path to a downloaded `.gmt` file is read in and a 
`BiocSet` object is returned. The example below uses a hallmark element set 
downloaded from [GSEA][], which is also included with this package. This 
`BiocSet` includes a `source` column within the `es_elementset` tibble for 
reference as to where the element set came from.

[GSEA]: http://software.broadinstitute.org/esea/index.jsp

```{r gmt}
gmtFile <- system.file(package = "BiocSet",
                        "extdata",
                        "hallmark.gene.symbol.gmt")
tbl2 <- import(gmtFile)
tbl2
```

`export()` allows for a `BiocSet` object to be exported into a temporary file 
with the extention `.gmt`.

```{r export, tidy = TRUE}
fl <- tempfile(fileext = ".gmt")
gmt <- export(tbl2, fl)
gmt
```

We also support importing files to create an `OBOSet` object which extends 
the `BiocSet` class. The default only displays the most basic amount of columns 
needed to perform certain tasks, but the argument `extract_tag = everything` 
allows for additional columns to be imported as well. The following workflow 
will demonstrate the use of both export and import to create a smaller subset 
of the large obo file that is accessable from GeneOntology.

```{r obo, eval = FALSE}
download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo")

foo <- import("obo_file.obo", extract_tag = "everything")

small_tst <- es_element(foo)[1,] %>%
    unnest("ancestors") %>%
    select("element", "ancestors") %>%
    unlist() %>%
    unique()

small_oboset <- foo %>% filter_elementset(element %in% small_tst)

fl <- tempfile(fileext = ".obo")
export(small_oboset, fl)
```

Then you can import this smaller obo file which we utilize here, in tests, and 
example code.
```{r small_obo}
oboFile <- system.file(package = "BiocSet",
                        "extdata",
                        "sample_go.obo")
obo <- import(oboFile)
obo
```

## Implemented functions

A feature available to `BiocSet` is the ability to activate different 
tibbles to perform certain functions on. When a `BiocSet` is created, the tibble
 `es_elementset` is automatically activated and all functions will be performed
 on this tibble. `BiocSet` adopts the use of many common `dplyr` functions such
 as `filter()`, `select()`, `mutate()`, `summarise()`, and `arrange()`. With 
each of the functions the user is able to pick a different tibble to activate
 and work on by using 'verb_tibble'. After the function is executed than the 
'active' tibble is returned back to the tibble that was active before the 
function call. Some examples are shown below of how these functions work.

```{r activate}
tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
tbl %>% filter_element(element == "a" | element == "A")
tbl %>% mutate_set(pval = rnorm(1:2))
tbl %>% arrange_elementset(desc(element))
```

## Set operations

`BiocSet` also allows for common set operations to be performed on the `BiocSet`
 object. `union()` and `intersection()` are the two set operations available in 
`BiocSet`. We demonstate how a user can find the union between two `BiocSet` 
objects or within a single `BiocSet` object. Intersection is used in the same 
way.

```{r set_ops}
# union of two BiocSet objects
es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)])
es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)])
union(es1, es2)

# union within a single BiocSet object
es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)])
union_single(es3)
```

# Case study I

Next, we demonstrate the a couple uses of `BiocSet` with an experiment dataset 
`airway` from the package `airway`. This data is from an RNA-Seq experiment on 
airway smooth muscle (ASM) cell lines.

The first step is to load the library and the necessary data.

```{r airway, message = FALSE}
library(airway)
data("airway")
se <- airway
```

The function `go_sets()` discovers the keys from the org object and uses 
`AnnotationDbi::select` to create a mapping to GO ids. `go_sets()` also allows 
the user to indicate which evidence type or ontology type they would like when 
selecting the GO ids. The default is using all evidence types and all ontology 
types.  We represent these identifieres as a `BiocSet` object. Using the 
`go_sets` function we are able to map the Ensembl ids and GO ids from the genome
 wide annotation for Human data in the `org.Hs.eg.db` package. The Ensembl ids 
are treated as elements while the GO ids are treated as sets.

```{r go_sets, message = FALSE}
library(org.Hs.eg.db)
go <- go_sets(org.Hs.eg.db, "ENSEMBL")
go

# an example of subsetting by evidence type
go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))
```

Some users may not be interested in reporting the non-descriptive elements. We 
demonstrate subsetting the `airway` data to include non-zero assays and then 
filter out the non-descriptive elements.

```{r drop_assays}
se1 = se[rowSums(assay(se)) != 0,]
go %>% filter_element(element %in% rownames(se1))
```

It may also be of interest to users to know how many elements are in each set. 
Using the `count` function we are able to calculate the elements per set.

```{r count}
go %>% group_by(set) %>% dplyr::count()
```

It may also be helpful to remove sets that are empty. Since we have shown how 
to calculate the number of elements per set, we know that this data set does 
not contain any empty sets. We decide to demonstrate regardless for those users
 that may need this functionality.

```{r empty}
drop <- es_activate(go, elementset) %>% group_by(set) %>%
    dplyr::count() %>% filter(n == 0) %>% pull(set)
go %>% filter_set(!(set %in% drop))
```

To simplify mapping we created a couple `map` functions. `map_unique()` is used
 when there is known 1:1 mapping, This takes four arguements, a `BiocSet` 
object, an `AnnotationDbi` object, the id to map from, and the id to map to. 
`map_multiple()` needs a fifth argument to indicate how the function should 
treat an element when there is multiple mapping. Both functions utilize 
`mapIds` from `AnnotationDbi` and return a `BiocSet` object. In the example 
below we show how to use `map_unique` to map `go`'s ids from Ensembl to gene 
symbols.

```{r map_unique}
go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")
```

Another functionality of `BiocSet` is the ability to add information to the 
tibbles. Using the `GO.db` library we are able to map definitions to the GO 
ids. From there we can add the mapping to the tibble using `map_add()` and the 
mutate function.

```{r adding, message = FALSE}
library(GO.db)
map <- map_add_set(go, GO.db, "GOID", "DEFINITION")
go %>% mutate_set(definition = map)
```

The library `KEGGREST` is a client interface to the KEGG REST server. 
KEGG contains pathway maps that represent interaction, reaction and relation 
networks for various biological processes and diseases. `BiocSet` has a 
function that utilizes `KEGGREST` to develop a `BiocSet` object  that contains 
the elements for every pathway map in KEGG.

Due to limiations of the KEGGREST package, `kegg_sets` can take some time to 
run depending on the amount of pathways for the species of interest. Therefore 
we demonstrate using `BiocFileCache` to make the data available to the user.

```{r file_cache, message = FALSE}
library(BiocFileCache)
rname <- "kegg_hsa"
exists <- NROW(bfcquery(query=rname, field="rname")) != 0L
if (!exists)
{
    kegg <- kegg_sets("hsa")
    fl <- bfcnew(rname = rname, ext = ".gmt")
    export(kegg_sets("hsa"), fl)
}
kegg <- import(bfcrpath(rname=rname))
```

Within the `kegg_sets()` function we remove pathways that do not contain any 
elements. We then mutate the element tibble using the `map_add` function to 
contain both Ensembl and Entrez ids. 

```{r kegg_filter}
map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL")
kegg <- kegg %>% mutate_element(ensembl = map)
```

Since we are working with ASM data we thought we would subset the `airway` data
 to contain only the elements in the asthma pathway. This filter is performed 
on the KEGG id, which for asthma is "hsa05310".

```{r subset}
asthma <- kegg %>% filter_set(set == "hsa05310")

se <- se[rownames(se) %in% es_element(asthma)$ensembl,]

se
```

The filtering can also be done for multiple pathways. 

```{r multiple}
pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970")
multipaths <- kegg %>% filter_set(set %in% pathways)

multipaths
```

# Case study II

This example will start out the same way that Case Study I started, by loading 
in the `airway` dataset, but we will also do some reformating of the data. The 
end goal is to be able to perform a Gene Set Enrichment Analysis on the data and
 return a BiocSet object of the gene sets.

```{r airway2, message = FALSE}
data("airway")
airway$dex <- relevel(airway$dex, "untrt")
```

Similar to other analyses we perform a differential expression analysis on the 
`airway` data using the library `DESeq2` and then store the results into a 
tibble.

```{r DE}
library(DESeq2)
library(tibble)
des <- DESeqDataSet(airway, design = ~ cell + dex)
des <- DESeq(des)
res <- results(des)

tbl <- res %>% 
    as.data.frame() %>%
    as_tibble(rownames = "ENSEMBL") 
```

Since we want to use `limma::goana()` to perform the GSEA we will need to have 
ENTREZ identifiers in the data, as well as filter out some `NA` information. 
Later on this will be considered our 'element' tibble.

```{r ENTREZ}
tbl <- tbl %>% 
    mutate(
        ENTREZID = mapIds(
            org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL"
        ) %>% unname()
    )

tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID))
tbl
```

Now that the data is ready for GSEA we can go ahead and use `goana()` and make 
the results into a tibble. This tibble will be considered our 'set' tibble.

```{r goana}
library(limma)
go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>%
    as.data.frame() %>%
    as_tibble(rownames = "GOALL")
go_ids
```

The last thing we need to do is create a tibble that we will consider our 
'elementset' tibble. This tibble will be a mapping of all the elements and sets.

```{r final_tibble}
foo <- AnnotationDbi::select(
    org.Hs.eg.db,
    tbl$ENTREZID,
    "GOALL",
    "ENTREZID") %>% as_tibble()
foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct()
foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL)
foo
```

The function `BiocSet_from_elementset()` allows for users to create a BiocSet 
object from tibbles. This function is helpful when there is metadata contained 
in the tibble that should be in the BiocSet object. For this function to work 
properly, the columns that are being joined on need to be named correctly. For 
instance, in order to use this function on the tibbles we created we need to 
change the column in the 'element' tibble to 'element', the column in the 'set' 
tibble to 'set' and the same will be for the 'elementset' tibble. We demonstrate
 this below and then create the BiocSet object with the simple function.

```{r BiocSet_from_elementset}
foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL)
tbl <- tbl %>% dplyr::rename(element = ENTREZID)
go_ids <- go_ids %>% dplyr::rename(set = GOALL)
es <- BiocSet_from_elementset(foo, tbl, go_ids)
es
```

For those users that may need to put this metadata filled BiocSet object back 
into an object similar to GRanges or SummarizedExperiment, we have created 
functions that allow for the BiocSet object to be created into a tibble or 
data.frame.

```{r tibble_or_data.frame}
tibble_from_element(es)

head(data.frame_from_elementset(es))
```

# Information look up

A final feature to `BiocSet` is the ability to add reference information about 
all of the elements/sets. A user could utilize the function `url_ref()` to add 
information to the `BiocSet` object. If a user has a question about a specific 
id then they can follow the reference url to get more informtion. Below is an 
example of using `url_ref()` to add reference urls to the `go` data set we 
worked with above.

```{r url}
url_ref(go)
```

# Session info

```{r}
sessionInfo()
```