--- title: "methylGSA: Gene Set Analysis for DNA Methylation Datasets" author: "Xu Ren and Pei Fen Kuan" date: "`r Sys.Date()`" output: rmarkdown::html_document: highlight: pygments toc: true vignette: > %\VignetteIndexEntry{methylGSA: Gene Set Analysis for DNA Methylation Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(comment = "", message=FALSE, warning = FALSE) ``` ## Installation To install and load methylGSA ```{r eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("methylGSA") ``` ```{r} library(methylGSA) ``` Depending on the DNA methylation array type, other packages may be needed before running the analysis. If analyzing 450K array, `IlluminaHumanMethylation450kanno.ilmn12.hg19` needs to be loaded. ```{r} library(IlluminaHumanMethylation450kanno.ilmn12.hg19) ``` If analyzing EPIC array, `IlluminaHumanMethylationEPICanno.ilm10b4.hg19` needs to be loaded. ```{r eval = FALSE} library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) ``` If analyzing user-supplied mapping between CpG ID and gene name, neither `IlluminaHumanMethylation450kanno.ilmn12.hg19` nor `IlluminaHumanMethylationEPICanno.ilm10b4.hg19` needs to be loaded. ## Introduction The methylGSA package contains functions to carry out gene set analysis adjusting for the number of CpGs of each gene. It has been shown by Geeleher et al [1] that gene set analysis is extremely biased for DNA methylation data. This package contains three main functions to adjust for the bias in gene set analysis. * methylglm: Incorporating number of CpGs as a covariate in logistic regression. * methylRRA: Adjusting for multiple p-values of each gene by Robust Rank Aggregation [2], and then apply either over-representation analysis (ORA) or Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) [3] in gene set testing. * methylgometh: Adjusting the number of CpGs for each gene by weighted resampling and Wallenius non-central hypergeometric approximation. (via missMethyl [4]) ## Supported gene sets and gene ID types * Gene Ontology (via org.Hs.eg.db [5]) * KEGG (via org.Hs.eg.db [5]) * Reactome (via reactome.db [6]) * User-supplied gene sets. Supported input gene ID types: + "SYMBOL" + "ENSEMBL" + "ENTREZID" + "REFSEQ" ## Supported array types * Infinium Human Methylation 450K BeadChip (via IlluminaHumanMethylation450kanno.ilmn12.hg19 [7]) * Infinium Methylation EPIC BeadChip (via IlluminaHumanMethylationEPICanno.ilm10b4.hg19 [8]) * User supplied mapping between CpG ID and gene name ## Description of methylglm methylglm is an extention of GOglm [9]. GOglm adjusts length bias for RNA-Seq data by incorporating gene length as a covariate in logistic regression model. methylglm adjusts length bias in DNA methylation by the number of CpGs instead of gene length. For each gene set, we fit a logistic regression model: $$logit (\pi_{i}) = \beta_{0} + \beta_{1}x_{i} + \beta_{2}c_{i}$$ For each gene $i$, $\pi_{i}$ = Pr(gene $i$ is in gene set), $x_{i}$ represents negative logarithmic transform of its minimum p-value in differential methylation analysis, and $c_{i}$ is logarithmic transform of its number of CpGs. methylglm requires only a simple named vector. This vector contains p-values of each CpG. Names should be their corresponding CpG IDs. ### Example Here is what the input vector looks like: ```{r} data(cpgtoy) head(cpg.pval, 20) ``` Please note that the p-values here in `cpg.pval` is just for illustration purposes. They are used to illustrate how to use the functions in methylGSA. The actual p-values in differential methylation analysis may be quite different from the p-values in `cpg.pval` in terms of the magnitude. Then call methylglm: ```{r} res1 = methylglm(cpg.pval = cpg.pval, minsize = 200, maxsize = 500, GS.type = "KEGG") head(res1, 15) ``` Result is a data frame ranked by p-values of gene sets. ## Description of methylRRA Robust rank aggregation [2] is a parameter free model that aggregates several ranked gene lists into a single gene list. The aggregation assumes random order of input lists and assign each gene a p-value based on order statistics. We apply this order statistics idea to adjust for number of CpGs. For gene $i$, let $P_{1}, P_{2}, ... P_{n}$ be the p-values of individual CpGs in differential methylation analysis. Under the null hypothesis, $P_{1}, P_{2}, ... P_{n} ~ \overset{i.i.d}{\sim} Unif[0, 1]$. Let $P_{(1)}, P_{(2)}, ... P_{(n)}$ be the order statistics. Define: $$\rho = \text{min}\{\text{Pr}(P_{(1)}