% \VignetteEngine{knitr}
% \VignetteIndexEntry{DOSE - an R package for Disease Ontology Semantic and Enrichment}
% \VignettePackage{DOSE}

% To compile this document, run the commands within R
% require(knitr); knit2pdf("DOSE.Rnw")

\documentclass[12pt]{article}
\usepackage[a4paper,left=3cm,top=2cm,bottom=3cm,right=3cm,ignoreheadfoot]{geometry}
\pagestyle{empty}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{helvet}
\usepackage[titletoc]{appendix}
\usepackage{tocloft}

\setlength{\parindent}{0em}
\setlength{\parskip}{.5em}

\renewcommand{\familydefault}{\sfdefault}

\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\small\texttt{#1}}}
\newcommand{\Rfunction}[1]{\Robject{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rparameter}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\R}{\textsf{R}}

\usepackage[
  bookmarks,%
  colorlinks,%
  linktoc=section,%
  linkcolor=RedViolet,%
  citecolor=RedViolet,%
  urlcolor=RedViolet,%
  linkbordercolor={1 1 1},%
  citebordercolor={1 1 1},%
  urlbordercolor={1 1 1},%
  raiselinks,%
  plainpages,%
  pdftex
  ]{hyperref}


\usepackage{cite}
\renewcommand{\floatpagefraction}{0.9}

\usepackage{sectsty}
\sectionfont{\sffamily\bfseries\color{RoyalBlue}\sectionrule{0pt}{0pt}{-1ex}{1pt}}
\subsectionfont{\sffamily\bfseries\color{RoyalBlue}}
\subsubsectionfont{\sffamily\bfseries\color{RoyalBlue}}
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead{}
\fancyfoot{}
\lfoot{}\cfoot{}\rfoot{}
\renewcommand{\headrule}{}

\usepackage{graphicx}
%\usepackage{xstring}


<<include=FALSE>>=
library(DOSE)
library(clusterProfiler)
library(knitr)
opts_chunk$set(tidy=TRUE,tidy.opts=list(keep.blank.line=FALSE, width.cutoff=50),out.truncate=80,out.lines=6,cache=TRUE,dev='pdf',include=TRUE,fig.width=6,fig.height=6,resolution=150)
@

\newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}}


\title{\textsf{\textbf{Disease Ontology Semantic and Enrichment analysis}}}
\author{Guangchuang Yu, Li-Gen Wang}

\begin{document}
\maketitle

<<options,results='hide',echo=FALSE>>=
options(digits=3, width=80, prompt=" ", continue=" ")
@

\begin{abstract}
Disease Ontology (DO) aims to provide an open source ontology for the integration of biomedical data that is associated with human disease. We developed \Rpackage{DOSE} package to promote the investigation of diseases. \Rpackage{DOSE} provides five methods including Resnik, Lin, Jiang, Rel and Wang for measuring semantic similarities among DO terms and gene products; Hypergeometric model and gene set enrichment analysis were also implemented for extracting disease association insight from genome wide expression profiles.
\end{abstract}

\tableofcontents


\section{Introduction}
Public health is an important driving force behind biological and medical research. 
A major challenge of the post-genomic era is bridging the gap between fundamental biological research and its clinical applications.  
Recent research has increasingly demonstrated that many seemingly dissimilar diseases have common molecular mechanisms. 
Understanding similarities among disease aids in early diagnosis and new drug development.
\\
\\
Formal knowledge representation of gene-disease association is demanded for this purpose. 
Ontologies, such as Gene Ontology, have been successfully applied to represent biological knowledge, and many related techniques have been adopted to extract information.
Disease Ontology (DO) \cite{schriml_disease_2011} was developed to create a consistent description of gene products with disease perspectives, and is essential for supporting functional genomics in disease context.
Accurate disease descriptions can discover new relationships between genes and disease, and new functions for previous uncharacteried genes and alleles.
\\
\\
Unlike other clinical vocabularies that defined disease related concepts disparately, DO is organized as a directed acyclic graph, laying the foundation for quantitative computation of disease knowledge. The application of disease ontology is in its infancy, lacking programs for mining DO knowledge automatically.
\\
\\
Here, we present an \textsf{R} package \Rpackage{DOSE} for analyzing semantic similarities among DO terms and gene products annotated with DO terms, and extracting disease association insight from genome wide expression profiles.
\\
\\
Four information content (IC)-based methods and one graph structure-based method were implemented for measuring semantic similarity. Hypergeometric test and Gene Set Enrichment Analysis were implemented for extracting biological insight.
\\
\\
To start with \Rpackage{DOSE} package, type following code below:

<<loading, eval=FALSE, results='hide'>>=
library(DOSE)
help(DOSE)
@

\section{DO term semantic similarity measurement}
Four methods determine the semantic similarity of two terms based on the Information Content of 
their common ancestor term were proposed by Resnik \cite{Resnik1999}, Jiang \cite{Jiang1997}, 
Lin \cite{Lin1998} and Schlicker \cite{Schlicker2006}. Wang \cite{Wang2007} presented a method to measure the similarity based on the graph structure. Each of these methods has its own advantage and weakness. \Rpackage{DOSE} implemented all these methods to compute semantic similarity among DO terms and gene products. We have developed another package \Rpackage{GOSemSim} \cite{GYu2010} to explore the functional similarity at GO perspective, including molecular function (MF), biological process (BP) and cellular component (CC).

\subsection{Information content-based method}
Information content (IC) is defined as the negative logarithm of the frequency of each term occurs in the corpus of DO annotation. 
\\
\\
The frequency of a term t is defined as:
$$ 
p(t) = \frac{n_{t'}}{N} 
	| t' \in \left\{t, \; children\: of\: t \right\}
$$
where $n_{t'}$ is the number of term t', and N is the total number of terms in DO corpus.

Thus the information content is defined as:
$$ IC(t) = -\log(p(t)) $$

IC-based methods calculate similarity of two DO terms based on the information content of their closest common ancestor term, which was also called most informative information ancestor (MICA).

\subsubsection{Resnik method}
The Resnik method is defined as:
$$
sim_{Resnik}(t_1,t_2) = IC(MICA)
$$

\subsubsection{Lin method}
The Lin method is defined as:
$$
sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}
$$

\subsubsection{Rel method}
The Relevance method, which was proposed by Schlicker, combine Resnik's and Lin's method 
and is defined as:
$$
sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}
$$

\subsubsection{Jiang method}
The Jiang and Conrath's method is defined as:
$$
sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))
$$

\subsection{Graph-based method}
Graph-based methods using the topology of DO graph structure to compute semantic similarity. 
Formally, a DO term A can be represented as $DAG_{A} = (A, T_{A}, E_{A})$ where $T_{A}$ is the set of DO terms in $DAG_{A}$, including term A and all of its ancestor terms in the DO graph, and $E_{A}$ is the set of edges connecting the DO terms in $DAG_{A}$.

\subsubsection{Wang method}
To encode the semantic of a DO term in a measurable format to enable a quantitative comparison, Wang firstly defined the semantic value of term A as the aggregate contribution of all terms in $DAG_{A}$ to the semantics of term A, terms closer to term A in $DAG_{A}$ contribute more to its semantics. Thus, defined the contribution of a DO term \textit{t} to the semantic of DO term A as the S-value of DO term \textit{t} related to term A. For any of term \textit{t} in $DAG_{A}$, its S-value related to term A, $S_{A}(\textit{t})$ is defined as:
$$
      \left\{
        \begin{array}{l}
          S_{A}(A)=1 \\
          S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A
        \end{array}
      \right.
$$
where $w_{e}$ is the semantic contribution factor for edge $e \in
    E_{A}$ linking term \textit{t} with its child term
    \textit{t}'. Term A contributes to its own is defined as
    one. After obtaining the S-values for all terms in $DAG_{A}$, the
    semantic value of DO term A, SV(A), is calculated as:
$$
SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)
$$
Thus given two DO terms A and B, the semantic similarity between these
    two terms is defined as:
$$
sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}
						{SV(A) + SV(B)}
$$
    where $S_{A}(\textit{t})$ is the S-value of DO term \textit{t}
    related to term A and $S_{B}(\textit{t})$ is the S-value of DO
    term \textit{t} related to term B.

\subsection{doSim function}
In \Rpackage{DOSE}, we implemented all these IC-based and graph-based methods. \Rfunction{doSim} can calculate semantic similarity between two DO terms and two set of DO terms.

<<doSim>>=
data(DO2EG)
set.seed(123)
a <- sample(names(DO2EG), 10)
a
b <- sample(names(DO2EG), 5)
b
doSim(a[1], b[1], measure="Wang")
doSim(a[1], b[1], measure="Resnik")
doSim(a[1], b[1], measure="Lin")
s <- doSim(a, b, measure="Wang")
s
@

\Rfunction{doSim} requires three parameter \Rparameter{DOID1}, \Rparameter{DOID2} and \Rparameter{measure}. \Rparameter{DOID1} and \Rparameter{DOID2} should be a vector of DO terms, while \Rparameter{measure} should be one of Resnik, Jiang, Lin, Rel, and Wang.
\\
\\
We also implement a plot function \Rfunction{simplot} to visualize the similarity result.
<<simplot, fig.cap="Visualizing similarity matrix.", fig.align="center", fig.height=6, fig.width=6, out.width="0.6\\textwidth", fig.pos="h">>=
simplot(s, 
		color.low="white", color.high="red", 
		labs=TRUE, digits=2, labs.size=5, 
		font.size=14, xlab="", ylab="")
@

Parameter \Rparameter{color.low} and \Rparameter{colow.high} are used to setting the color gradient; \Rparameter{labs} is a logical parameter indicating whether to show the similarity values or not, \Rparameter{digits} to indicate the number of decimal places to be used and \Rparameter{labs.size} setting the size of similarity values; \Rparameter{font.size} setting the font size of axis and label of the coordinate system.

\section{Gene semantic similarity measurement}
On the basis of semantic similarity between DO terms, \Rpackage{DOSE} can also compute semantic similarity among gene products.
\\
\\
Suppose we have gene $g_1$ annotated by DO term set $DO_{1}=\{do_{11},do_{12} \cdots do_{1m}\}$ and $g_2$ annotated by $DO_{2}=\{do_{21},do_{22} \cdots do_{2n}\}$, \Rpackage{DOSE} implemented four methods which called \Rfunction{max}, \Rfunction{avg}, \Rfunction{rcmax} and \Rfunction{BMA} to combine semantic similairty scores of multiple DO terms. 

\subsection{Combine method}

\subsubsection{max}
The \Rfunction{max} method calculates the maximum semantic similarity score over all pairs of DO terms between these two DO term sets.
$$
sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(do_{1i}, do_{2j})
$$

\subsubsection{avg}
The \Rfunction{avg} calculates the average semantic similarity score over all pairs of DO terms.
$$
sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(do_{1i}, do_{2j})}
						{m \times n}
$$

\subsubsection{rcmax}
Similarities among two sets of DO terms form a matrix, the \Rfunction{rcmax} method uses the maximum of RowScore and ColumnScore as the similarity, where RowScore (or ColumnScore) is the average of maximum similarity on each row (or column).
$$
sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1
    \le j \le n} sim(do_{1i}, do_{2j})}{m},
\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(do_{1i},
  do_{2j})}{n})
$$

\subsubsection{BMA}
The \Rfunction{BMA} method, used the best-match average strategy, calculates the average of all maximum similarities on each row and column, and is defined as:
$$
sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j
    \le n}sim(do_{1i}, do_{2j}) + \displaystyle\sum_{1=j}^n \max_{1
    \le i \le m}sim(do_{1i}, do_{2j})} {m+n}
$$

\subsection{geneSim function}
In \Rpackage{DOSE}, we implemented \Rfunction{geneSim} to measure semantic similarities among genes.
<<geneSim>>=
data(EG2DO)
g1 <- sample(names(EG2DO), 5)
g1
g2 <- sample(names(EG2DO), 4)
g2
geneSim(g1[1], g2[1], measure="Wang", combine="BMA")
gs <- geneSim(g1, g2, measure="Wang", combine="BMA")
gs
@

\Rfunction{geneSim} requires four parameter \Rparameter{geneID1}, \Rparameter{geneID2}, \Rparameter{measure} and \Rparameter{combine}. \Rparameter{geneID1} and \Rparameter{geneID2} should be a vector of entrez gene IDs, \Rparameter{measure} should be one of Resnik, Jiang, Lin, Rel, and Wang, while \Rparameter{combine} should be one of max, avg, rcmax and BMA as described previously.

The \Rfunction{simplot} works well with both the output of \Rfunction{doSim} and \Rfunction{geneSim}.


\section{DO term enrichment analysis}

\subsection{Hypergeometric model}
Enrichment analysis \cite{boyle2004} is a widely used approach to identify biological
themes. Here we implement hypergeometric model to assess whether the
number of selected genes associated with disease is larger than
expected. 

To determine whether any DO terms annotate a specified list of
    genes at frequency greater than that would be expected by chance,
    \Rpackage{DOSE} calculates a p-value using the hypergeometric distribution:

$
p = 1 - \displaystyle\sum_{i = 0}^{k-1}
  \frac{
      {M \choose i}
      {{N-M} \choose {n-i}}
    } {
      {N \choose n}
    }
$

In this equation, \textit{N} is the total number of genes in the
background distribution, \textit{M} is the number of genes within that
distribution that are annotated (either directly or indirectly) to the
node of interest, \textit{n} is the size of the list of genes of
interest and \textit{k} is the number of genes within that list which
are annotated to the node. The background distribution by default is
all the genes that have DO annotation.

P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control.

\subsection{enrichDO function}
\Rpackage{DOSE} provides an example dataset \textit{geneList} which was derived from \textsf{R} package \Rpackage{breastCancerMAINZ} that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versue geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in \textit{geneList} dataset.
\\
\\
In the following example, we selected fold change above 1 as the differential genes and analyzing their disease association.
<<enrichment analysis>>=
data(geneList)
gene <- names(geneList)[abs(geneList) > 1]
head(gene)
x <- enrichDO(gene, ont="DOLite", 
		pvalueCutoff=0.05, pAdjustMethod="BH",
		universe = names(geneList), 
		minGSSize = 5, readable=FALSE)
head(summary(x))
@

The \Rfunction{enrichDO} requires an entrezgene ID vector as input, mostly is the differential gene list of gene expression profile studies. The \Rparameter{ont} parameter can be "DO" or "DOLite", DOLite \cite{Du15062009} was constructed to aggregate the redundant DO terms; \Rparameter{pvalueCutoff} setting the cutoff value of p value and p value adjust; pAdjustMethod setting the p value correction methods, include the Bonferroni correction ("bonferroni"), Holm ("holm"), Hochberg ("hochberg"), Hommel ("hommel"), Benjamini \& Hochberg ("BH") and Benjamini \& Yekutieli ("BY").

The \Rparameter{universe} setting the background gene universe for testing. If user do not explicitly setting this parameter, \Rfunction{enrichDO} will set the universe to all human genes that have DO annotation.

The \Rparameter{minGSSize} indicates that only those DO terms that have more than \Rparameter{minGSSize} genes annotated will be tested.

The \Rparameter{readable} is a logical parameter, indicates whether the entrezgene IDs will mapping to gene symbols or not.      

We also implement \Rfunction{setReadable} function that helps the user to convert entrezgene IDs to gene symbols.
<<setReadable>>=
x <- setReadable(x)
head(summary(x))
@ 

\subsection{Visualze enrichment result}
We also implement a bar plot and category-gene-network for visualization. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.
<<barplot, fig.cap="barplot of DO enrichment result.", fig.align="center", fig.height=5, fig.width=6, out.width="0.6\\textwidth", fig.pos="h">>=
barplot(x)
@

In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories, we developed \Rfunction{cnetplot} function to extract the complex association between genes and diseases.
<<cnetplot, fig.cap="cnetplot of DO enrichment result.", fig.align="center", fig.height=16, fig.width=16, fig.pos="h">>=
cnetplot(x, categorySize="pvalue", foldChange=geneList)
@

\subsection{Disease association comparison}
We have developed an \textsf{R} package \Rpackage{clusterProfiler} \cite{yu_clusterprofiler:_2012} for comparing biological themes among gene clusters. \Rpackage{DOSE} works fine with \Rpackage{clusterProfiler} and can compare biological themes at disease perspective.
 
<<clusterProfiler, fig.cap="DOSE with clusterProfiler.", fig.align="center", fig.height=6, fig.width=6, out.width="0.6\\textwidth", fig.pos="h">>= 
require(clusterProfiler)
data(gcSample)
cdo <- compareCluster(gcSample, fun="enrichDO")
plot(cdo)
@

\section{Gene set enrichment analysis}

\subsection{GSEA algorithm}
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The DO term enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA) \cite{subramanian_gene_2005} directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
\\
\\
Genes are ranked based on their phenotypes. Given a priori defined set of gens \textit{S} (e.g., genes shareing the same \textit{DO} or \textit{DOLite} category), the goal of GSEA is to determine whether the members of \textit{S} are randomly distributed throughout the ranked gene list (\textit{L}) or primarily found at the top or bottom.
\\
\\
There are three key elements of the GSEA method:
\begin{itemize}
	\item Calculation of an Enrichment Score.\\
The enrichment score (\textit{ES}) represent the degree to which a set \textit{S} is over-represented at the top or bottom of the ranked list \textit{L}. The score is calculated by walking down the list \textit{L}, increasing a running-sum statistic when we encounter a gene in \textit{S} and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The \textit{ES} is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic \cite{subramanian_gene_2005}.
	\item Esimation of Significance Level of \textit{ES}.\\
The \textit{p-value} of the \textit{ES} is calculated using permutation test. Specifically, we permute the gene labels of the gene list \textit{L} and recompute the \textit{ES} of the gene set for the permutated data, which generate a null distribution for the \textit{ES}. The \textit{p-value} of the observed ES is then calculated relative to this null distribution.
	\item Adjustment for Multiple Hypothesis Testing.\\
When the entire \textit{DO} or \textit{DOLite} gene sets is evaluated, \Rpackage{DOSE} adjust the estimated significance level to account for multiple hypothesis testing and also \textit{q-values} were calculated for FDR control.
\end{itemize}

\subsection{gseAnalyzer fuction}
In \Rpackage{DOSE}, we implemented GSEA algorithm proposed by Subramanian \cite{subramanian_gene_2005} in \Rfunction{gseAnalyzer} function.

In the following example, in order to speedup the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed. 
<<GSEA analysis>>=
y <- gseAnalyzer(geneList, setType="DOLite", nPerm=100, 
                  minGSSize=120, pvalueCutoff=0.05, 
                  pAdjustMethod="BH", verbose=FALSE)
res <- summary(y)
head(res)
@
The \Rparameter{setType} should be one of "DO" or "DOLite and was required for \Rfunction{gseaAnalyzer} to prepare the corresponding gene sets.

<<gseaplot, fig.cap="gseaplot example.", fig.align="center", fig.height=6, fig.width=6, out.width="0.6\\textwidth", fig.pos="h">>=  
topID <- res[1,1]
topID
plot(y, geneSetID = topID)
@

Parameter \Rparameter{geneSetID} can be numeric, the following command will generate the same figure as illustrated above.
<<gseaplot2, eval=FALSE>>=
plot(y, geneSetID = 1)
@

\subsubsection{enrichMap}
Enrichment Map can be visualized by \Rfunction{enrichMap} function. It supports both enrichment result and GSEA result.


\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<sessInfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\bibliographystyle{unsrt}
\bibliography{DOSE}
\end{document}