\documentclass[a4paper]{article}
\title{The \emph{mgsa} package}
\author{Sebastian Bauer, Julien Gagneur}
\date{28 April 2011}

% The is for R CMD check, which finds it in spite of the "%", and also for
% automatic creation of links in the HTML documentation for the package:
% \VignetteIndexEntry{Overview of the mgsa package.}


\begin{document}

%%%%%%%% Setup

% Don't reform code
\SweaveOpts{keep.source=TRUE}

% Size for figures
\setkeys{Gin}{width=\textwidth}

% Reduce characters per line in R output

<<set_width, echo=FALSE>>=
options( width = 80 )
@ 

% Make title
\maketitle


%%%%%%%% Main text

\section{Introduction}
Model-based Gene Set Analysis (MGSA, Bauer et al.~\cite{Bauer2010}) is a Bayesian modeling approach for gene set enrichment.
The package \emph{mgsa} implements MGSA and tools to use MGSA together with the Gene Ontology~\cite{GO2000}. 

MGSA takes as input \emph{observations} (such as differentially expressed genes) and a list of gene \emph{sets} (for example pathways or annotated terms of the gene ontologies).
The model assumes that some sets to be inferred are \emph{active} and that all genes member of active sets are themselves active.
Active genes are more likely to belong to the observations and inactive genes not.
Fitting the model amounts to infer the probability of every set to be active given the observations.


This procedure provides a useful alternative to classical gene set enrichment analysis~\cite{Bauer2010}.
Classical methods analyze each set in isolation. Because sets such as biological pathways often share genes with each other, the returned list of enriched sets is usually long and redundant. 
In contrast, MGSA takes set overlap into account by working on all sets simultaneously and substantially reduces the number of redundant sets.

The model have three parameters:
\begin{itemize}
\item $\alpha$, the false positive rate i.e. the probability of an inactive gene to actually be observed;
\item $\beta$, the false negative rate i.e. the probability of an active gene to actually be not observed;
\item $p$, the prior probability for any set to be active, a typically small number ensuring sparse solutions to be inferred.
\end{itemize}
MGSA is Bayesian about these parameters too. Relatively uninformative priors are specified for them and the algorithm performs inference on the parameters as well.

Technical details about the algorithm and benchmarks of the method are given in~\cite{Bauer2010}.
 
\section{Quick start}
We start with a small simulated dataset which contains \texttt{example\_go}, a random subset of yeast gene ontology annotations with 20 terms and
\texttt{example\_o}, a simulated set of observed genes.
These genes could for example be the "hits" of some screen or a set of differentially expressed genes.
In the simulation, the terms GO:0006109 and GO:0030663 were active, implying that genes annotated to these terms were more likely to be observed positives than other genes. 
<<loadex>>=
library(mgsa)
data("example")
example_go
example_o
@

The method \texttt{mgsa} fits the MGSA model.
It returns a \texttt{MgsaMcmcResults} object whose \texttt{print} method displays the most likely active terms. 
On this example, \texttt{mgsa} correctly reports largest posterior probabilities for the terms GO:0006109 and GO:0030663.
The call to \texttt{set.seed()}, which sets the seed of the random number generator, simply ensures the example of this vignette to be reproducible.
It is not required for \texttt{mgsa()} to work. 
<<fitex>>=
set.seed(0)
fit = mgsa(example_o, example_go)
fit
@

The method \texttt{plot} provides a graphical visualization of the fit.

<<plotex, fig=TRUE>>=
plot(fit)
@

Finally, inference results, i.e., the set activities, can be extracted with the accessor function \texttt{setsResults()}.
The data frame returned by the function can be sorted or filtered with R's standard data frame operations.
The following code extract the set results into a data frame named \texttt{res} and filter for those with posterior probability estimate greater than 0.5.  
<<setsResults>>=
res = setsResults(fit)
subset(res, estimate > 0.5)
@

  
\section{Using the Gene Ontology}
The Gene Ontology~\cite{GO2000} (GO) provides structured annotations to genes.
Genes with the same annotation belong to the same gene set. MGSA can be run on these gene sets.
GO annotation files for the studied organism can be downloaded from the GO web page:
http://www.geneontology.org.

The function \texttt{readGAF} creates an \texttt{MgsaGoSets} object, a particular \texttt{MgsaSets}, from such a gene annotation file.
Note that \texttt{readGAF} requires the package \texttt{GO.db} and \texttt{RSQLite} to be installed.

For illustration purposes, a simplified GO annotation file with only three yeast genes is provided:

<<readGAF>>=
readGAF(
	system.file(
		"example_files/gene_association_head.sgd",
		package="mgsa"
	)
)
@     

\section{Using custom gene sets}
MGSA is not restricted to Gene Ontology and can be applied to any gene sets.
The method \texttt{mgsa} can directly be called on such gene sets provided as \texttt{list} as in the example below.
<<mgsa_custom_set>>=
mgsa( c("A", "B"), list(set1=LETTERS[1:3], set2=LETTERS[2:5]) ) 
@

Internally, the method \texttt{mgsa} indexes all elements of the sets before fitting the model.
In case \texttt{mgsa} must be run on several observations with the same gene sets, computations can be speeded up by performing this indexing once for all.
This can be achieved by building a \texttt{MgsaSets}.

<<mgsa_custom_set>>=
myset = new( "MgsaSets", sets=list(set1=LETTERS[1:3], set2=LETTERS[2:5]) )
mgsa(c("A", "B"), myset)
mgsa(c("B", "C"), myset)
@


\begin{thebibliography}{10}

\bibitem{Bauer2010}
S. Bauer, J. Gagneur, and P. N. Robinson.
\newblock GOing Bayesian: model-based gene set analysis of genome-scale data.
\newblock \textit{Nucleic acids research}, 2010.


\bibitem{GO2000}
The Gene Ontology Consortium.
\newblock Gene Ontology: tool for the unification of biology.
\newblock \textit{Nature Genetics}, 25:25--29,2000.

\end{thebibliography}

\end{document}