---
title: "methylGSA: Gene Set Analysis for DNA Methylation Datasets"
author: "Xu Ren and Pei Fen Kuan"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_document:
        highlight: pygments
        toc: true
vignette: >
    %\VignetteIndexEntry{methylGSA: Gene Set Analysis for DNA Methylation Datasets}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE}
knitr::opts_chunk$set(comment = "", message=FALSE, warning = FALSE)
```

## Installation

To install and load methylGSA

```{r eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("methylGSA")
```

```{r}
library(methylGSA)
```

Depending on the DNA methylation array type, other packages may be needed 
before running the analysis.    

If analyzing 450K array, `IlluminaHumanMethylation450kanno.ilmn12.hg19` needs 
to be loaded.
```{r}
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
```

If analyzing EPIC array, `IlluminaHumanMethylationEPICanno.ilm10b2.hg19` needs 
to be loaded.
```{r eval = FALSE}
library("IlluminaHumanMethylationEPICanno.ilm10b2.hg19")
```

If analyzing user-supplied mapping between CpG ID and gene name, neither
`IlluminaHumanMethylation450kanno.ilmn12.hg19` nor 
`IlluminaHumanMethylationEPICanno.ilm10b2.hg19` needs to be loaded.


## Introduction

The methylGSA package contains functions to carry out gene set analysis 
adjusting for the number of CpGs of each gene. It has been shown by 
Geeleher et al [1] that gene set analysis is extremely biased for DNA 
methylation data. This package contains three main functions to adjust for the 
bias in gene set analysis. 

* methylglm: Incorporating number of CpGs as a covariate in 
logistic regression. 

* methylRRA: Adjusting for multiple p-values of each gene by Robust 
Rank Aggregation [2], and then apply either 
over-representation analysis (ORA) or Preranked version of Gene Set Enrichment 
Analysis (GSEAPreranked) [3] in gene set testing. 

* methylgometh: Adjusting the number of CpGs for each gene by 
weighted resampling and Wallenius non-central hypergeometric approximation. 
(via missMethyl [4])

## Supported gene sets and gene ID types

* Gene Ontology (via org.Hs.eg.db [5])
* KEGG (via org.Hs.eg.db [5])
* Reactome (via reactome.db [6])
* User-supplied gene sets. Supported input gene ID types:
    + "SYMBOL"
    + "ENSEMBL"
    + "ENTREZID"
    + "REFSEQ"
    
## Supported array types

* Infinium Human Methylation 450K BeadChip 
(via IlluminaHumanMethylation450kanno.ilmn12.hg19 [7])
* Infinium Methylation EPIC BeadChip 
(via IlluminaHumanMethylationEPICanno.ilm10b2.hg19 [8])
* User supplied mapping between CpG ID and gene name

## Description of methylglm

methylglm is an extention of GOglm [9]. GOglm adjusts 
length bias for RNA-Seq data by incorporating gene length as a covariate 
in logistic regression model. methylglm adjusts length bias in DNA methylation 
by the number of CpGs instead of gene length. For each gene set, we fit a 
logistic regression model:

$$logit (\pi_{i}) = \beta_{0} + \beta_{1}x_{i} + \beta_{2}c_{i}$$
For each gene $i$, $\pi_{i}$ = Pr(gene $i$ is in gene set), $x_{i}$ represents 
negative logarithmic transform of its minimum p-value in differential 
methylation analysis, and $c_{i}$ is logarithmic transform of its number 
of CpGs. 

methylglm requires only a simple named vector. This vector contains p-values of 
each CpG. Names should be their corresponding CpG IDs. 

### Example

Here is what the input vector looks like:

```{r}
data(cpgtoy)
head(cpg.pval, 20)
```

Please note that the p-values here in `cpg.pval` is just for illustration 
purposes. They are used to illustrate how to use the functions in methylGSA. 
The actual p-values in differential methylation analysis may be quite 
different from the p-values in `cpg.pval`.    

Then call methylglm:

```{r}
res1 = methylglm(cpg.pval = cpg.pval, minsize = 200, 
                 maxsize = 500, GS.type = "KEGG")
head(res1, 15)
```

Result is a data frame ranked by p-values of gene sets.


## Description of methylRRA

Robust rank aggregation [2] is a parameter free model that 
aggregates several ranked gene lists into a single gene list. The aggregation 
assumes random order of input lists and assign each gene a p-value based on 
order statistics. We apply this order statistics idea to adjust for number 
of CpGs. 

For gene $i$, let $P_{1}, P_{2}, ... P_{n}$ be the p-values of individual CpGs 
in differential methylation analysis. 
Under the null hypothesis, 
$P_{1}, P_{2}, ... P_{n} ~ \overset{i.i.d}{\sim} Unif[0, 1]$. 
Let $P_{(1)}, P_{(2)}, ... P_{(n)}$ be the order statistics. Define: 
$$\rho = 
\text{min}\{\text{Pr}(P_{(1)}<P_{(1)\text{obs}}), 
\text{Pr}(P_{(2)}<P_{(2)\text{obs}})..., 
\text{Pr}(P_{(n)}<P_{(n)\text{obs}}) \} $$

methylRRA supports two approaches to adjust for number of CpGs, ORA and 
GSEAPreranked [3]. In ORA approach, for gene $i$, conversion from 
$\rho$ score into p-value is done by Bonferroni correction [2]. 
We get a p-value for each gene and these p-values are then corrected for 
multiple testing use Benjamini & Hochberg procedure 
[10]. By default, genes satisfy FDR<0.05 are considered DE genes. If there are
no DE genes under FDR 0.05, users are able to use `sig.cut` option to specify
a higher FDR cut-off or `topDE` option to declare top genes to be 
differentially expressed. We then apply ORA based on these DE genes. 

In GSEAPreranked approach, for gene $i$, we also convert $\rho$ score into 
p-value by Bonferroni correction. p-values are converted into z-scores. We then 
apply Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) on 
the gene list ranked by the z-scores.  

### Example

To apply ORA approach, we use argument `method = "ORA"` in methylRRA

```{r eval=FALSE}
res2 = methylRRA(cpg.pval = cpg.pval, method = "ORA", 
                    minsize = 200, maxsize = 210)
head(res2, 15)
```

To apply GSEAPreranked approach, we use argument `method = "GSEA"` in methylRRA

```{r eval=FALSE}
res3 = methylRRA(cpg.pval = cpg.pval, method = "GSEA", 
                    minsize = 200, maxsize = 210)
head(res3, 10)
```



## Description of methylgometh

methylgometh calls `gometh` or `gsameth` function in missMethyl 
package [4] to adjust number of CpGs in gene set 
testing. `gometh` modifies goseq method [11] by fitting a 
probability weighting function and resampling from Wallenius non-central 
hypergeometric distribution. 

methylgometh requires two inputs, `cpg.pval` and `sig.cut`. `sig.cut` specifies
the cut-off point to declare a CpG as differentially methylated. By default, 
`sig.cut` is 0.001. Similar to methylRRA, if no CpG is significant, users are 
able to specify a higher cut-off or use `topDE` option to declare 
top CpGs to be differentially methylated.

### Example

```{r eval=FALSE}
res4 = methylgometh(cpg.pval = cpg.pval, sig.cut = 0.001, 
                        minsize = 200, maxsize = 210)
head(res4, 15)
```


## Other options

methylGSA provides many other options for users to customize the analysis. 

* `array.type` is to specify which array type to use. It is either "450K" or 
"EPIC". Default is "450K". This argument will be ignored if `FullAnnot` is 
provided.
* `FullAnnot` is preprocessed mapping between CpG ID and gene name provided by
prepareAnnot function. Default is NULL. Check example below for details. 
* `group` is the type of CpG the user which to consider. If group is 
"body" ("promoter"), CpGs on gene body (promoters) will be considered. Default
is "all", which means all CpGs are considered. 
* `GS.list` is user supplied gene sets to be tested. It should be a list with 
entry names gene sets IDs and elements correpond to genes that gene sets 
contain. If there is no input list, Gene Ontology is used.
* `GS.idtype` is the type of gene ID in user supplied gene sets. If `GS.list` 
is not empty, then the user is expected to provide gene ID type. Supported ID 
types are "SYMBOL", "ENSEMBL", "ENTREZID", "REFSEQ". Default is "SYMBOL". 
* `GS.type` is the published gene sets/pathways to be tested if `GS.list` is 
empty. Supported pathways are "GO", "KEGG", and "Reactome". Default is "GO".
* `minsize` is an integer. If the number of genes in a gene set is less than 
this integer, this gene set is not tested. Default is 100.
* `maxsize` is also an integer. If the number of genes in a gene set is 
greater than this integer, this gene set is not tested. Default is 500.
* In methylRRA, `method` is to specify gene set test method. It is either 
"ORA" or "GSEA" as described above. 
* In methylglm, `parallel` is either `TRUE` or `FALSE` indicating whether 
parallel should be used.

### Examples

Here an example of user supplied gene sets. The gene ID type is gene symbol
```{r}
data(GSlisttoy)
## to make the display compact, only a proportion of each gene set is shown
head(lapply(GS.list, function(x) x[1:30]), 3)   
```

This is an example of running methylglm with parallel
```{r eval=FALSE}
library(BiocParallel)
res_p = methylglm(cpg.pval = cpg.pval, minsize = 200, 
                  maxsize = 500, GS.type = "KEGG", parallel = TRUE)
```

methylglm and methylRRA support user supplied CpG ID to gene mapping. The 
mapping is expected to be a matrix, or a data frame or a list. For a 
matrix or data frame, 1st column should be CpG ID and 2nd column should be gene 
name. For a list, entry names should be gene names and elements correpond to 
CpG IDs. This is an example of user supplied CpG to gene mapping:
```{r}
data(CpG2Genetoy)
head(CpG2Gene)   
```

To use user supplied mapping in methylglm or methylRRA, first preprocess the
mapping by prepareAnnot function
```{r}
FullAnnot = prepareAnnot(CpG2Gene) 
```

Test the gene sets using "ORA" in methylRRA, use `FullAnnot` argument to 
provide the preprocessed CpG ID to gene mapping
```{r}
GS.list = GS.list[1:10]
res5 = methylRRA(cpg.pval = cpg.pval, FullAnnot = FullAnnot, method = "ORA", 
                    GS.list = GS.list, GS.idtype = "SYMBOL", 
                    minsize = 100, maxsize = 300)
head(res5, 10)
```

Here is another example. Test Reactome pathways using methylglm

```{r eval=FALSE}
res6 = methylglm(cpg.pval = cpg.pval, array.type = "450K", 
                    GS.type = "Reactome", minsize = 100, maxsize = 110)
head(res6, 10)
```

## Visualization

Following bar plot implemented in enrichplot [12], we also provide bar plot to 
visualize the gene set analysis results. The input of `barplot` function can 
be any result returned by methylglm, methylRRA, or methylgometh. Various 
options are provided for users to customize the plot.

* `xaxis` is to specify the label in x-axis. It is either "Count" 
(number of significant genes in gene set) or "Size" (total number of genes 
in gene set). "Count" option is not available for methylglm and 
methylRRA(GSEA). Default is "Size".
* `num` is to specify the number of genes sets to display on the bar plot.
Default is 5.
* `colorby` is a string. Either "pvalue" or "padj". Default is "padj".
* `title` is a string. The title to display on the bar plot. Default is NULL. 

### Example

Here is an example of using barplot to visualize the result of methylglm
```{r}
barplot(res1, num = 8, colorby = "pvalue")
```


## Session info

```{r sessionInfo}
sessionInfo()
```


## References
[1] Geeleher, Paul, Lori Hartnett, Laurance J Egan, Aaron Golden, 
Raja Affendi Raja Ali, and Cathal Seoighe. 2013. Gene-Set Analysis Is 
Severely Biased When Applied to Genome-Wide Methylation Data.
Bioinformatics 29 (15). Oxford University Press: 1851–7.

[2] Kolde, Raivo, Sven Laur, Priit Adler, and Jaak Vilo. 2012. 
Robust Rank Aggregation for Gene List Integration and Meta-Analysis. 
Bioinformatics 28 (4). Oxford University Press: 573–80.

[3] Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, 
Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. 
Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting 
Genome-Wide Expression Profiles. Proceedings of the National Academy of 
Sciences 102 (43). National Acad Sciences: 15545–50.

[4] Phipson, Belinda, Jovana Maksimovic, and Alicia Oshlack. 2015. 
MissMethyl: An R Package for Analyzing Data from Illumina's Humanmethylation450 
Platform. Bioinformatics 32 (2). Oxford University Press: 286–88.

[5] Carlson M (2018). org.Hs.eg.db: Genome wide annotation for Human. 
R package version 3.6.0.

[6] Ligtenberg W (2018). reactome.db: A set of annotation maps for reactome. 
R package version 1.64.0.

[7] Hansen, KD. 2016a. IlluminaHumanMethylation450kanno.ilmn12.hg19: 
Annotation for Illumina’s 450k Methylation Arrays. R Package, Version 0.6.0 1.

[8] ———. 2016b. IlluminaHumanMethylationEPICanno.ilm10b2.hg19: 
Annotation for Illumina’s Epic Methylation Arrays. R Package, Version 0.6.0 1.

[9] Mi, Gu, Yanming Di, Sarah Emerson, Jason S Cumbie, and Jeff H Chang. 2012. 
Length Bias Correction in Gene Ontology Enrichment Analysis Using 
Logistic Regression. PloS One 7 (10). Public Library of Science: e46128.

[10] Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the False 
Discovery Rate: A Practical and Powerful Approach to Multiple Testing.
Journal of the Royal Statistical Society. Series B (Methodological). 
JSTOR, 289–300.

[11] Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia 
Oshlack. 2012. Goseq: Gene Ontology Testing for Rna-Seq Datasets.
R Bioconductor.

[12] Yu G (2018). enrichplot: Visualization of Functional Enrichment Result. 
R package version 1.0.2, https://github.com/GuangchuangYu/enrichplot.