<!--
%\VignetteEngine{knitr}
%\VignetteIndexEntry{3. Automatic Functional Enrichment on Signature Genes}
-->

Automatic Functional Enrichment on Signature Genes
=============================================================

**Author**: Zuguang Gu ( z.gu@dkfz.de )

**Date**: `r Sys.Date()`

**Package version**: `r installed.packages()["cola", "Version"]`

-------------------------------------------------------------

```{r, echo = FALSE, message = FALSE}
library(markdown)
options(markdown.HTML.options = c("use_xhtml", "smartypants", "base64_images", "mathjax", "highlight_code"))
options(markdown.HTML.stylesheet = system.file("resources", "markdown.css", package = "markdown"))
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE,
    warning = FALSE,
    fig.align = "center")
options(width = 100)
library(cola)
```

If the matrix rows can correspond to genes (e.g. the gene expression matrix,
or the methylation array data where CpG sites can be annotated to the
transcription start site of genes), _cola_ performs functional enrichment by
the `functional_enrichment()` function to the signatures by
[*ClusterProfiler*](https://bioconductor.org/packages/clusterProfiler/),
[*DOSE*](https://bioconductor.org/packages/DOSE/) or
[*ReactomePA*](https://bioconductor.org/packages/ReactomePA/) packages.


We first demonstrate the usage of `functional_enrichment()` function by the
TCGA GBM dataset. In following example code, `TCGA_GBM_subgroup.rds` is
generated by the code demonstrated
[here](https://jokergoo.github.io/cola_examples/TCGA_GBM/). We download the
result file that has already been generated.

```{r, results = "hide"}
download.file("https://jokergoo.github.io/cola_examples/TCGA_GBM/TCGA_GBM_subgroup.rds", 
    destfile = "TCGA_GBM_subgroup.rds", quiet = TRUE)
rl = readRDS("TCGA_GBM_subgroup.rds")
file.remove("TCGA_GBM_subgroup.rds")
```

We select result from a single method `ATC:skmeans`:

```{r}
library(cola)
res = rl["ATC:skmeans"]
res
```

We check how the signature genes looks like under 4-group classification:

```{r}
set.seed(123)
get_signatures(res, k = 4)
```

Rows are split into four groups with different expression patterns among
samples. The functional enrichment will be applied to genes in each
row-cluster.

To apply functional enrichment, the important thing is to check the gene ID
type in the input matrix. The helper function `rownames()` directly returns
the row names of the matrix stored in `res`.

```{r}
head(rownames(res))
```

The gene ID is symbol. For all enrichment analysis provided by
*ClusterProfiler*, *DOSE* or *ReactomePA*, the core ID type is Entrez
ID, thus we need to convert from symbol to Entrez ID.

To make it easy, *cola* automatically tests the gene IDs types and it
automatically recognizes three ID types of Ensembl ID, RefSeq ID and gene
symbol, which covers most cases of the analysis. If user's gene ID type is one
of the three supported ones, simply run `functional_enrichment()` on `res`
only with specifying the number of subgroups.

```{r, eval = FALSE}
lt = functional_enrichment(res, k = 4)
```

```{r, echo = FALSE}
if(file.exists("lt_functional_enrichment_TCGA_GBM.rds")) {
	lt = readRDS("lt_functional_enrichment_TCGA_GBM.rds")
} else {
	lt = functional_enrichment(res, k = 4)
	saveRDS(lt, "lt_functional_enrichment_TCGA_GBM.rds", compress = "xz")
}
```

By default, `functional_enrichment()` runs enrichment on Gene Ontology,
biological function ontologies. `ontology` can be set as follows:

- `BP`/`MF`/`CC`, `org_db` argument should be set to [the corresponding
  database](http://bioconductor.org/packages/release/BiocViews.html#___Organism),
  such as `"org.Hs.eg.db"`,
- `KEGG`, `organism` argument should be set to corresponding species
  abbreviation, such as `"hsa"`,
- `DO`, only works for human,
- `MSigDb`, only works for human, the path of gmt file should be specified by
  `gmt_file` argument. You should only use the gmt files where genes are annotated with the Entrez IDs.
- `Reactome`, `organism` argument should be set to the corresponding species,
  such as `"human"`.

`ontology` can be set as a vector of multiple ontologies.

The value of `lt` is a list of data frames for different ontologies combined
with different k-means groups. Since k-means clustering has already been
applied in previous `get_signatures()`, the k-means clustering result is
stored in `res` object and `functional_enrichment()` directly uses the
classification from it.

```{r}
names(lt)
head(lt[[1]])
```

If the gene ID type is not any of Ensembl ID, RefSeq ID or gene symbol, user
needs to provide a named vector which provides mapping between user's ID types
to Entrez IDs.

In following example we demonstrate how to properly set the ID mapping by [the
Golub leukemia
dataset](https://jokergoo.github.io/cola_examples/Golub_leukemia/). The result
file is already generated.

```{r, results = "hide"}
download.file("https://jokergoo.github.io/cola_examples/Golub_leukemia/Golub_leukemia_subgroup.rds", 
    destfile = "Golub_leukemia_subgroup.rds", quiet = TRUE)
rl = readRDS("Golub_leukemia_subgroup.rds")
file.remove("Golub_leukemia_subgroup.rds") 
```

To simplify, we only take result from one method:

```{r}
res = rl["ATC:skmeans"]
head(rownames(res))
set.seed(123)
get_signatures(res, k = 3)
```

The Golub leukemia dataset is a microarray dataset where the gene ID is the
probe ID. Thankfully, there is already an annotation package from Bioconductor
([hu6800.db](https://bioconductor.org/packages/hu6800.db/)) that provides
mapping between the probe ID to Entrez ID.

```{r}
library(hu6800.db)
x = hu6800ENTREZID
mapped_probes = mappedkeys(x)
id_mapping = unlist(as.list(x[mapped_probes]))
head(id_mapping)
```

Proportion of probe IDs that can be mapped:

```{r}
sum(!is.na(id_mapping[rownames(res)]))/nrow(res)
```

As you see, the format of `id_mapping` is simple. Names of the vector are the
probe IDs and the values are the Entrez IDs. We can directly assign the ID
mapping variable to `id_mapping` argument.

```{r, eval = FALSE}
lt = functional_enrichment(res, k = 3, id_mapping = id_mapping)
```

`functional_enrichment()` can also be applied to two other classes of objects:

- The `ConsensusPartitionList` object which is generated by `run_all_partition_methods()` function. The result is a list (for each method) of lists (for each ontology) of data frames. 
- Simply a vector of gene IDs. The result is a data frame.


```{r}
sessionInfo()
```