---
title: "FoldGO: a tool for fold-change-specific functional enrichment analysis"
author: "Daniil Wiebe"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{FoldGO: a tool for fold-change-specific functional enrichment analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

devtools::load_all(".")
library(kableExtra)
library(topGO)
```

A typical scenario of transcriptome data analysis is identification of differentially expressed genes (DEGs), those with significant changes in the number of transcripts (fold-change), and functional enrichment analysis of DEGs lists using Gene Ontology. Classical gene set enrichment analysis ignores the difference in the fold-changes that lead to the loss of valuable information. The FoldGO method has been created to identify the GO terms significantly overrepresented for the genes responded to the factor within a narrow fold-change-interval. FoldGO processes the DEGs list in three steps:

* At the first step, FoldGO sorts the genes by their fold-change values, subdivides them into gene sets of equal size (quantiles) and generates the gene sets for all combinations of the neighboring quantiles, including the whole lists of up-regulated or down-regulated genes.
* At the second step, FoldGO generates the list of GO terms annotated to at least one differentially expressed gene (DEG). Optionally, GO terms significantly enriched in the DEGs annotation can be selected at this step.
* At the third step, FoldGO performs enrichment analysis for all selected GO terms and for all gene sublists. FoldGO measures the bias in the portion of the genes responded fold-change-specifically versus whole DEGs for the genes associated with a GO term or the background. If the result of the test is significant, such GO terms are considered as fold-change-specific.

See an example of the FoldGO performance in the analysis of the transcriptome data on expression changes of Arabidopsis thaliana genes in response to plant hormone auxin treatment (Omelyanchuk et al., 2017)^[Omelyanchuk, N. A. et al. Auxin regulates functional gene groups in a fold-change-specific manner in *Arabidopsis thaliana* roots // Nat Sci Rep – 2017. – N 7(1) – p 2489].

## Workflow

FoldGO pipeline consists of three steps:

* gene sublists generation;
* GO terms preselection (optional);
* fold-change-specific enrichment analysis.

As input data the algorithm uses the tables for up- and down-regulated genes that contain Gene IDs and their fold-change values:

| GeneID | fold-change |
|:------|:----:|
|AT3G65420|3.6|
|AT1G78450|1.5|
|AT2G66890|2.1|
|...|

First, one has to separate initial set of genes in to quantiles and generate unions of all neighbouring quantiles. For example, we will use built-in data derived from experiment on auxin treatment of *Arabidopsis thaliana* roots. GeneGroups function will take only first two columns, so be sure that your data have gene identifiers in the first column and fold-change values (FC) in the second one. Note that data for up- and down-regulation must be processed separately. In the following example demonstrating GeneGroups function usage only the data for up-regulation is used.
```{r, eval=FALSE}
head(degenes)
```
```{r, warning = FALSE, message = FALSE, echo=FALSE}
cut_degenes <- head(degenes)
cut_degenes[, c(3, 4)] <- sapply(c(3, 4), function(x) formatC(as.numeric(cut_degenes[, x]), digits = 2,  format="e"))
knitr::kable(cut_degenes, row.names = FALSE)
```
```{r groups, warning = FALSE, message = FALSE}
up_groups <- GeneGroups(degenes, quannumber=6)
```

At the next step we will conduct functional enrichment analysis of generated groups. For functional enrichment analysis FoldGO uses TopGO package.

### Functional annotation

#### Custom annotaion

For custom annotation one has to provide GO id -> Gene id list or object of `MgsaSets` (from `mgsa` package) or `GAFReader` class. 
FoldGO provides GAFReader - simple and convinient parser for annotation presented in GAF format. It takes only two arguments: 

* file - full path to GAF file.
* geneid_col - index of column with Genes IDs

NOTE: The GAF file in example is truncated. The original full file can be downloaded from GO consortium website: <http://www.geneontology.org/page/download-go-annotations>.

```{r gafread, warning = FALSE, message = FALSE}
gaf_path <- system.file("extdata", "gene_association.tair.lzma", package = "FoldGO")
gaf <- GAFReader(file = gaf_path,  geneid_col = 10)
```

One can retrieve direct annotations as list object and version of GAF file using following methods:

```{r gaf_getters chunk 1, warning = FALSE, message = FALSE}
getVersion(gaf)
```

```{r gaf_getters chunk 2, warning = FALSE, message = FALSE, results='hide'}
getAnnotation(gaf)
```

To annotate our gene groups we will use FuncAnnotGroupsTopGO function which uses topGO package for singular enrichment analysis. The minimal set of arguments needed for this function to work is:

* groups - object of GeneGroups class 
* namespace - character string specifing GO namespace. This argument accepts the following values: "BP", "MF" or "CC", where 
    * BP - biological process
    * MF - molecular function
    * CC - cellular component
* customAnnot - It can be an object generated by \code{\link{GAFReader}} or \code{readGAF} function from mgsa package or list which 
has GO term ids as keys and character vectors contain gene ids as values.
* annot - functions used to compile a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term. Here it must be assigned with topGO::annFUN.GO2genes
* bggenes - vector contains background set of genes

```{r annot, eval = FALSE}
up_annotobj <- FuncAnnotGroupsTopGO(genegroups = up_groups, namespace = "MF", customAnnot = gaf, annot = topGO::annFUN.GO2genes, bggenes = bggenes)
```

#### Using annotaions from Bioconductor packages

Another possibility is to use bioconductor packages containing annotations for specific organism. For example "org.Hs.eg.db" (Human) and "org.At.tair.db" (Arabidopsis), package name must be assigned to `mapping` argument. In this case one has to assign `topGO::annFUN.org` to `annot` argument and specify `ID` (from topGO package manual: character string specifing the gene identifier to use). Currently only the following identifiers can be used: `c("entrez", "genbank", "alias", "ensembl", "symbol", "genename", "unigene")`. 

Enrichment analysis with Arabidopsis annotations:

```{r ara_annot, eval = FALSE}
up_annotobj <- FuncAnnotGroupsTopGO(up_groups,"MF", mapping = "org.At.tair.db", annot = topGO::annFUN.org, ID = "entrez", bggenes = bggenes)
```

Enrichment analysis with annotations for Human:

```{r human_annot, warning = FALSE, message = FALSE, results='hide'}
up_groups <- GeneGroups(degenes_hum, quannumber=6)
FuncAnnotGroupsTopGO(up_groups,"MF", mapping = "org.Hs.eg.db", annot = topGO::annFUN.org, ID = "ensembl", bggenes = bggenes_hum)
```

### Testing for fold-specificity

The fold-specificity recognition procedure consists of GO terms preselection from DEGs annotation and fold-change-specific enrichment analysis. At each step the FDR threshold must be established. By default FDR threshold for GO terms preselection (fdrstep1) is set to 1 (no preselection) and FDR threshold for fold-change-specific enrichment analysis (fdrstep2) is set to 0.05. As a default method for mutltiple testing correction FoldGO uses Benjamini-Hochberg correction procedure^[Benjamini, Y., Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society - 1995 - Series B. 57 (1): 289–300]. 

```{r fs_test, warning = FALSE, message = FALSE,}
up_fsobj <- FoldSpecTest(up_annotobj, fdrstep1 = 0.05, fdrstep2 = 0.01)
down_fsobj <- FoldSpecTest(down_annotobj, fdrstep1 = 0.05, fdrstep2 = 0.01)
```

It is possible to choose another correction procedure from R base which can be listed via `p.adjust.methods`. Here Benjamini-Yekutieli correction procedure^[ Benjamini, Y., Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics - 2001 - 29 (4): 1165–1188. doi:10.1214/aos/1013699998] is selected.

```{r fs_test_padj, eval=FALSE}
FoldSpecTest(up_annotobj, padjmethod = "BY")
```

One can inspect the results of enrichment analysis as dataframes. Access dataframe with fold-specific terms

```{r fs_table}
fs_table <- getFStable(up_fsobj)
```
```{r, eval=FALSE}
head(fs_table)
```
```{r, echo=FALSE}
library(kableExtra)
fs_table[, c(4, 5, 6, 7)] <- sapply(c(4, 5, 6, 7), function(x) formatC(as.numeric(fs_table[, x]), digits = 2,  format="e"))
knitr::kable(head(fs_table)) %>% kable_styling(font_size = 12)
```

where:

* ids - GO term identifier
* namespace - GO term namespace
* name - GO term full name
* wholeint_pval - p-value for specific GO term derived from annotation for all differentially expressed genes set
* wholeint_padj - q-value for specific GO term derived from annotation for all differentially expressed genes set
* min_pval - minimal p-value for specific GO term across all intervals
* padj - adjusted minimal p-value for specific GO term across all intervals 
* interval - interval that corresponds to minimal p-value for specific GO term

And with not fold-specific:
```{r nfs_table}
nfs_table <- getNFStable(up_fsobj)
```
```{r, eval=FALSE}
head(nfs_table)
```
```{r, echo=FALSE}
nfs_table[, c(4, 5, 6, 7)] <- sapply(c(4, 5, 6, 7), function(x) formatC(as.numeric(nfs_table[, x]), digits = 2,  format="e"))
knitr::kable(head(nfs_table)) %>% kable_styling(font_size = 12)
```

### Plot results 

Via `plot` function one can plot “Fold-change specific GO Profile” on which the GO terms significantly associated with a certain fold-change intervals are presented in yellow and blue boxes for up- and down-regulated genes, correspondingly. Here the result for six equal in size fold-change intervals is presented. The diagram presents only fold-change-specific terms. If the gene was associated fold-specifically with down-regulation but not fold-specifically with up-regulation (or vise versa) than not fold-change-specific interval (1-6 here) will be also shown.

```{r fs_plot, warning = FALSE, message = FALSE, fig.height = 10, fig.width = 7}
plot(up_fsobj, down_fsobj)
```