%\VignetteIndexEntry{Hypergeometric Tests Using GOstats}
%\VignetteDepends{ALL, GO.db, annotate, hgu95av2.db, genefilter, RColorBrewer,
%xtable, Rgraphviz} 
%\VignetteKeywords{Hypergeometric, GO, ontology}
%\VignettePackage{GOstats}
\documentclass[11pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newlength{\smallfigwidth}
\setlength{\smallfigwidth}{6cm}

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\title{How To Use GOstats \\
Testing Gene Lists for GO Term Association}
\author{S. Falcon and R. Gentleman}

\begin{document}

\maketitle

\section{Introduction}

The\Rpackage{GOstats} package has extensive facilities for testing the
association of Gene Ontology (GO)~\cite{GO} terms to genes in a gene
list.  You can test for both over and under representation of GO terms
using either the standard Hypergeometric test or a conditional
Hypergeometric test that uses the relationships among the GO terms for
conditioning (similar to that presented in~\cite{Alexa06}).

In this vignette we describe the preprocessing required to construct
inputs for the main testing function, \Rfunction{hyperGTest}, the
algorithms used, and the structure of the return value.  We use a
microarray data set \citep{Chiaretti2004} from a clinical trial in
acute lymphoblastic leukemia (ALL) to work an example analysis.  In
the ALL data, we focus on the patients with B-cell derived ALL, and in
particular on comparing the group with ALL1/AF4 to those
with no observed cytogenetic abnormalities.

To get started, load the packages needed for this analysis:

<<Setup, echo=TRUE, results=hide>>=
library("ALL")
library("hgu95av2.db")
library("GO.db")
library("annotate")
library("genefilter")
library("GOstats")
library("RColorBrewer")
library("xtable")
library("Rgraphviz")
@

\section{Preprocessing and Inputs}

To perform an analysis using the Hypergeometric-based tests, one needs
to define a \textit{gene universe} (usually conceptualized as the
number of balls in an urn) and a list of selected genes from the
universe.  While it is clear that the selected gene list determines to
a large degree the results of the analysis, the fact that the universe
has a large effect on the conclusions is, perhaps, less obvious.

For microarray data, one can use the unique gene identifiers assayed
in the experiment as the gene universe.  However, the presence of a
gene on the array does not necessarily mean much.  Some arrays, such
as those from Affymetrix, attempt to include probes for as much of the
genome as possible.  Since not all genes will be expressed under all
conditions (a widely held belief is that about 40\% of the genome
is expressed in any tissue), it may be sensible to reduce the universe
to those that are expressed.

To identify the set of expressed genes from a microarray experiment,
we propose that a non-specific filter be applied and that the genes
that pass the filter be used to form the universe for any subsequent
functional analyses.  Below, we outline the non-specific filtering
procedure used for the example analysis.

Once a gene universe has been established, one can apply any number of
methods to select genes.  For the example analysis we use a simple
$t$-test to identify differentially expressed genes among the two
subgroups in the sample population.

<<universeSizeVsPvalue, echo=FALSE, results=hide>>=
hg_tester <- function(size) {
    numFound <- 10
    numDrawn <- 400
    numAtCat <- 40
    numNotAtCat <- size - numAtCat
    phyper(numFound-1, numAtCat, numNotAtCat, numDrawn, lower.tail=FALSE)
}
pv1000 <- hg_tester(1000)
pv5000 <- hg_tester(5000)
@ 

It is worth noting that the effect of increasing the universe size
with genes that are irrelevant to the questions at hand, in general,
has the effect of making the resultant $p$-values look more
significant.  For example, in a universe of 1000 genes where 400 have
been selected, suppose that a GO term has 40 gene annotations from the
universe of 1000.  If 10 of the genes in the selected gene list are
among the 40 genes annotated at this category, then the Hypergeometric
$p$-value is \Sexpr{round(pv1000,2)}.  However, if the gene universe
contained 5000 genes, the $p$-value would drop to
\Sexpr{round(pv5000,3)}.


\subsection{Non-specific filtering}

First we load the \Robject{ALL} data object and extract the subset of
the data we wish to analyze: subjects with either no cytogenetic
abnormality (``NEG'') or those with ''ALL1/AF4''.

<<bcrAblOrNegSubset, results=hide, echo=TRUE, cache=TRUE>>=
data(ALL, package="ALL")
## For this data we can have ALL1/AF4 or BCR/ABL
subsetType <- "ALL1/AF4"

## Subset of interest: 37BRC/ABL + 42NEG = 79 samples
Bcell <- grep("^B", as.character(ALL$BT))
bcrAblOrNegIdx <- which(as.character(ALL$mol) %in% c("NEG", subsetType))

bcrAblOrNeg <- ALL[, intersect(Bcell, bcrAblOrNegIdx)]
bcrAblOrNeg$mol.biol = factor(bcrAblOrNeg$mol.biol)
@ 

We begin our non-specific filtering by removing probe sets that have
no Entrez Gene identifier in our annotation data.

<<nsFiltering-noEntrez, results=hide, echo=TRUE, cache=TRUE>>=
## Remove genes that have no entrezGene id
entrezIds <- mget(featureNames(bcrAblOrNeg), envir=hgu95av2ENTREZID)
haveEntrezId <- names(entrezIds)[sapply(entrezIds, function(x) !is.na(x))]
numNoEntrezId <- length(featureNames(bcrAblOrNeg)) - length(haveEntrezId)
bcrAblOrNeg <- bcrAblOrNeg[haveEntrezId, ]
@ 

Similarly, we remove probe sets for which we have no GO annotation.

<<nsFiltering-noGO, results=hide, echo=TRUE, cache=TRUE>>=
## Remove genes with no GO mapping
haveGo <- sapply(mget(featureNames(bcrAblOrNeg), hgu95av2GO),
                 function(x) {
                     if (length(x) == 1 && is.na(x)) 
                       FALSE 
                     else TRUE
                 })
numNoGO <- sum(!haveGo)
bcrAblOrNeg <- bcrAblOrNeg[haveGo, ]
@

Now use the IQR of each probe set across samples to remove those probe
sets that have little variation across samples.  Also, since there is
an imbalance of men and women by group, we remove probe sets that
measure genes on the Y chromosome.

<<nsFiltering-IQR, results=hide, echo=TRUE, cache=TRUE>>=
## Non-specific filtering based on IQR
iqrCutoff <- 0.5
bcrAblOrNegIqr <- apply(exprs(bcrAblOrNeg), 1, IQR)
selected <- bcrAblOrNegIqr > iqrCutoff

## Drop those that are on the Y chromosome
## because there is an imbalance of men and women by group
chrN <- mget(featureNames(bcrAblOrNeg), envir=hgu95av2CHR)
onY <- sapply(chrN, function(x) any(x=="Y"))
onY[is.na(onY)] <- FALSE
selected <- selected & !onY

nsFiltered <- bcrAblOrNeg[selected, ]
@ 

Here we ensure that each probe set maps to exactly one Entrez Gene ID.
If multiple probes are found to map to the same Entrez Gene ID, we
select the probe with largest IQR (from the computation above).

<<nsFiltering-unique, results=hide, echo=TRUE, cache=TRUE>>=
numNsWithDups <- length(featureNames(nsFiltered))
nsFilteredIqr <- bcrAblOrNegIqr[selected]
uniqGenes <- findLargest(featureNames(nsFiltered), nsFilteredIqr, 
                         "hgu95av2")
nsFiltered <- nsFiltered[uniqGenes, ]
numSelected <- length(featureNames(nsFiltered))

##set up some colors
BCRcols = ifelse(nsFiltered$mol == subsetType, "goldenrod", "skyblue")
cols = brewer.pal(10, "RdBu")
@

Finally, we can define the gene universe we will use for the
Hypergeometric tests.

<<defineGeneUniverse, echo=TRUE, results=hide, cache=TRUE>>=
## Define gene universe based on results of non-specific filtering
affyUniverse <- featureNames(nsFiltered)
entrezUniverse <- unlist(mget(affyUniverse, hgu95av2ENTREZID))
if (any(duplicated(entrezUniverse)))
  stop("error in gene universe: can't have duplicate Entrez Gene Ids")

## Also define an alternate universe based on the entire chip
chipAffyUniverse <- featureNames(bcrAblOrNeg)
chipEntrezUniverse <- mget(chipAffyUniverse, hgu95av2ENTREZID)
chipEntrezUniverse <- unique(unlist(chipEntrezUniverse)) 
@ 

\textbf{Summary of non-specific filtering:} Our non-specific filtering
procedure removed probes missing either Entrez Gene identifies or
mappings to GO terms.  Because of an imbalance of men and women by
group, probes measuring genes on the Y chromosome were dropped.  The
inter-quartile range was used with a cutoff of \Sexpr{iqrCutoff} to
select probes with sufficient variability across samples to be
informative; probes with little variability across all samples are
inherently uninteresting.  Finally, the set of remaining probes was
refined by ensuring that each probe maps to exactly one Entrez Gene
identifier.  For those probes mapping to the same Entrez Gene ID, the
probe with largest IQR was selected.

Producing a set of Entrez Gene identifiers that map to a unique set of
probes at the non-specific filtering stage is important because genes
are mapped to GO categories using Entrez Gene IDs and we want to avoid
double counting any GO categories.  In all, the filtering left
\Sexpr{length(featureNames(nsFiltered))} genes.

\subsection{Gene selection via t-test}

We apply a standard $t$-test to identify a set of genes with
differential expression between the \Sexpr{subsetType} and NEG groups.

<<parametric1, echo=TRUE, results=hide, cache=TRUE>>=
ttestCutoff <- 0.05
ttests = rowttests(nsFiltered, "mol.biol")

smPV = ttests$p.value < ttestCutoff

pvalFiltered <- nsFiltered[smPV, ]
selectedEntrezIds <- unlist(mget(featureNames(pvalFiltered),
                                 hgu95av2ENTREZID))
@

There are \Sexpr{sum(smPV)} genes with $p$-values less than
\Sexpr{ttestCutoff}.  We do not make use of any $p$-value correction
methods since we are interested in a relatively long gene list.

A detail often omitted from GO association analyses is the fact that
the $t$-test, and most similar statistics, are directional.  For a
given gene, average expression might be higher in the
\Sexpr{subsetType} group than in the NEG group, whereas for a
different gene it might be the NEG group that shows the increased
expression.  By only looking at the $p$-values for the test
statistics, the directionality is lost.  The danger is that an
association with a GO category may be found where the genes are not
differentially expressed in the same direction.  One way to tackle
this problem is by separating the selected gene list into two lists
according to direction and running two analyses.  A more elegant
approach is the subject of further research.


\subsection{Inputs}

Often one wishes to perform many similar analyses using slightly
different sets of parameters and to facilitate this pattern of usage
the main interface to the Hypergeometric tests,
\Rfunction{hyperGTest}, takes a single parameter object as its
argument. This argument is an instance of class
\Rclass{GOHyperGParams}.  There are also parameter classes
\Rclass{KEGGHyperGParams} and \Rclass{PFAMHyperGParams} defined in the
\Rpackage{Category} package that allow for testing for association
with KEGG pathways and PFAM protein domains, respectively.

Using a parameter class instead of individual arguments makes it
easier to organize and execute a series of related analyses.  For
example, one can create a list of \Rclass{GOHyperGParams} instances
and perform the Hypergeometric test on each using R's
\Rfunction{lapply} function:

\begin{verbatim}
resultList <- lapply(lisOfParamObjs, hyperGTest)
\end{verbatim}

In the absence of a parameter class, this could be achieved using
\Rfunction{mapply}, but the result would be less readable.  Because
parameter objects can be copied and modified, they tend to reduce
duplication of code.  We'll demonstrate this in the following example.

Below, we create a parameter instance by specifying the gene list, the
universe, the name of the annotation data package, and the GO ontology
we wish to interrogate.  For the example analysis, we have stored the
vector of Entrez Gene identifiers making up the gene universe in
\Robject{entrezUniverse}.  The selected genes are stored in
\Robject{selectedEntrezIds}.  If you are following along with your own
data and have an \Rclass{ExpressionSet} instance resulting from a
non-specific filtering procedure, you can create the
\Robject{entrezUniverse} and \Robject{selectedEntrezIds} vectors using
code similar to that shown here:

<<withYourData1, eval=FALSE>>=
entrezUniverse <- unlist(mget(featureNames(yourData), 
                              hgu95av2ENTREZID))
if (any(duplicated(entrezUniverse)))
  stop("error in gene universe: can't have duplicate Entrez Gene Ids")

pvalFiltered <- yourData[hasSmallPvalue, ]
selectedEntrezIds <- unlist(mget(featureNames(pvalFiltered),
                                 hgu95av2ENTREZID))
@ 

Here is a description of all the arguments needed to construct a
\Rclass{GOHyperGParams} instance.

\begin{description}
\item[\Robject{geneIds}] A vector of gene identifiers that defines the
  selected list of genes.  This is often the output of a test for
  differential expression among two sample groups.  For experiments
  using Affymetrix expression arrays, this should be a vector of
  Entrez Gene IDs.  If you are using the \Rpackage{YEAST} annotation
  package, the vector will consist of yeast systematic names.

\item[\Robject{universeGeneIds}] A vector of gene identifiers that
  defines the universe of possible genes.  We recommend using the set
  of gene IDs that result from non-specific filtering.  The
  identifiers should be of the same type as the \Robject{geneIds}; for
  Affymetrix arrays, these will be Entrez Gene IDs.

\item[\Robject{annotation}] A string giving the name of the annotation
  data package that corresponds to the chip used in the experiment.

\item[\Robject{ontology}] A two-letter string specifying one of the
  three GO ontologies: BP, CC, or MF.  The \Rfunction{hyperGTest}
  function only tests a single GO ontology at one time.

\item[\Robject{pvalueCutoff}] A numeric values between zero and one
  used as a $p$-value cutoff for $p$-values generated by the
  Hypergeometric test.  When the test being performed is
  non-conditional, this is only used as a default value for printing
  and summarizing the results.  For a conditional analysis, the cutoff
  is used during the computation to determine perform the
  conditioning: child terms with a $p$-value less than
  \Robject{pvalueCutoff} are conditioned out of the test for their
  parent term.

\item[\Robject{conditional}] A logical value.  If \Robject{TRUE}, the
  test performed uses the conditional algorithm.  Otherwise, a
  standard Hypergeometric test is performed.  When 'conditional(p) ==
  TRUE', the 'hyperGTest' function uses the structure of the GO graph to
  estimate for each term whether or not there is evidence beyond that
  which is provided by the term's children to call the term in question
  statistically overrepresented. The algorithm conditions on all child
  terms that are themselves significant at the specified p-value cutoff.
  Given a subgraph of one of the three GO ontologies, the terms with no
  child categories are tested first.  Next the nodes whose children have
  already been tested are tested.  If any of a given node's children
  tested significant, the appropriate conditioning is performed.

\item[\Robject{testDirection}] A string which can be either ``over''
  or ``under''.  This determines whether the test performed detects
  over or under represented GO terms.

\end{description}


<<standardHyperGeo, echo=TRUE, results=hide, cache=TRUE>>=
hgCutoff <- 0.001
params <- new("GOHyperGParams",
              geneIds=selectedEntrezIds,
              universeGeneIds=entrezUniverse,
              annotation="hgu95av2.db",
              ontology="BP",
              pvalueCutoff=hgCutoff,
              conditional=FALSE,
              testDirection="over")

@ 

We would also like to perform a conditional test.  Instead of having
to define a new \Rclass{GOHyperGParams} instance by hand, we can
create a copy and update just the parameter of interest.

<<condHyperGeo, echo=TRUE, results=hide, cache=TRUE>>=
paramsCond <- params
conditional(paramsCond) <- TRUE
@ 

A similar approach would work to create a parameter object for testing
a different GO ontology or to create an object for testing under
representation.


\section{Outputs and Result Summarization}

The \Rfunction{hyperGTest} function returns an instance of class
\Rclass{GOHyperGResult}.  When the input parameter object is a
\Rclass{KEGGHyperGParams} or \Rclass{PFAMHyperGParams} instance, the
result will instead be a \Rclass{HyperGResult} object.  Most of the
reporting and summarization methods demonstrated here will work the
same, except for those that deal specifically with GO or the GO graph.

As shown below, printing the result at the R prompt provides a brief
summary of the test performed and the number of significant terms
found.  Depending on how you pre-processed your gene list and gene
universe, The hyperGTest function may have to do even more filtering
on both of these for you.  Genes that are not marked with a GO term
from the ontology that you specified will have to be discarded,
and so you might notice that your gene list and gene universe had
shrank somewhat when you print the results.

<<standardHGTEST, cache=TRUE, echo=TRUE, results=hide>>=
hgOver <- hyperGTest(params)
hgCondOver <- hyperGTest(paramsCond)

@ 

%% use separate chunk for printing so computation can
%% be cached
<<HGTestAns>>=
hgOver
hgCondOver
@ 

The \Rfunction{summary} function returns a \Rclass{data.frame}
summarizing the result.  By default, only the results for terms with a
$p$-value less than the cutoff specified in the parameter instance
will be shown.  However, you can set a new cutoff using the
\Robject{pvalue} argument.  You can also set a minimum number of genes
for each term using the \Robject{categorySize} argument.  For
\Rclass{GOHyperGResult} objects, the \Rfunction{summary} method also
has a \Robject{htmlLinks} argument.  When \Robject{TRUE}, the GO term
names are printed as HTML links to the GO website.
%% FIXME: the default should be changed to be htmlLinks=FALSE
%% and htmlReport should turn it on.
<<summaryEx>>=
df <- summary(hgOver)
names(df)                               # the columns
dim(summary(hgOver, pvalue=0.1))
dim(summary(hgOver, categorySize=10))
@ 

Now we demonstrate some of the accessor functions that can be used to
extract detail from a result object.  These functions are all detailed
in their respective manual pages.

<<resultAccessors>>=
pvalues(hgOver)[1:3]

oddsRatios(hgOver)[1:3]

expectedCounts(hgOver)[1:3]

geneCounts(hgOver)[1:3]
universeCounts(hgOver)[1:3]

length(geneIds(hgOver))
length(geneIdUniverse(hgOver)[[3]])

## GOHyperGResult _only_
## (NB: edges go from parent to child)
goDag(hgOver)

geneMappedCount(hgOver)
universeMappedCount(hgOver)

conditional(hgOver)
testDirection(hgOver)
testName(hgOver)

@ 

To make it easy for non-technical users to review the results, the
\Rfunction{htmlReport} function generates an HTML file that can be
viewed in any web browser.  The output generated by
\Rfunction{htmlReport} as called below is output to your current
working directory.

<<htmlReportExample, results=hide, echo=TRUE>>=
htmlReport(hgCondOver, file="ALL_hgco.html")
@ 


<<helperFunc, echo=FALSE, results=hide, cache=TRUE>>=
sigCategories <- function(res, p) {
    if (missing(p))
      p <- pvalueCutoff(res)
    pv <- pvalues(res)
    goIds <- names(pv[pv < p])
    goIds
}
@ 

<<plotFuns, echo=FALSE, results=hide, cache=TRUE>>=
coloredGoPlot <- function(ccMaxGraph, hgOver, hgCondOver) {
    nodeColors <- sapply(nodes(ccMaxGraph),
                         function(n) {
                             if (n %in% sigCategories(hgCondOver))
                               "dark red"
                             else if (n %in% sigCategories(hgOver))
                               "pink"
                             else 
                               "gray"
                         })
    nattr <- makeNodeAttrs(ccMaxGraph,
                           label=nodes(ccMaxGraph),
                           shape="ellipse",
                           fillcolor=nodeColors,
                           fixedsize=FALSE)
    plot(ccMaxGraph, nodeAttrs=nattr)
}

getMaxConnCompGraph <- function(hgOver, hgCondOver) {
    ##uGoDagRev <- ugraph(goDag(hgOver))
    sigNodes <- sigCategories(hgOver)
    ##adjNodes <- unlist(adj(uGoDagRev, sigNodes))
    adjNodes <- unlist(adj(goDag(hgOver), sigNodes))
    displayNodes <- unique(c(sigNodes, adjNodes))
    displayGraph <- subGraph(displayNodes, goDag(hgOver))
    cc <- connComp(displayGraph)
    ccSizes <- listLen(cc)
    ccMaxIdx <- which(ccSizes == max(ccSizes))
    ccMaxGraph <- subGraph(cc[[ccMaxIdx]], displayGraph)
    ccMaxGraph 
}

@ 

\section{GOstats Capabilities}

In the Hypergeometric model, each term is treated as an independent
classification.  Each gene is cross-classified according to whether or
not it has been selected and whether or not it is annotated, not
necessarily specifically annotated, at a particular term.  A
Hypergeometric probability is computed to assess whether the number of
selected genes associated with the term is larger than expected.

The \Rfunction{hyperGTest} function provides an implementation of the
commonly applied Hypergeometric calculation for over or
under-representation of GO terms in a specified gene list.  This
computation ignores the structure of the GO terms, treating each term
as independent from all other terms.

%%FIXME: need to cite Alexa here again...
Often an analysis for GO term associations results in the
identification of directly related GO terms with considerable overlap
of genes.  This is because each GO term inherits all annotations from
its more specific descendants.  To alleviate this problem, we have
implemented a method which conditions on all child terms that are
themselves significant at a specified $p$-value cutoff.  Given a
subgraph of one of the three GO ontologies, we test the leaves of the
graph, that is, those terms with no child terms.  Before testing the
terms whose children have already been tested, we remove all genes
annotated at significant children from the parent's gene list.  This
continues until all terms have been tested.


<<info, results=tex>>=
toLatex(sessionInfo())
@ 

\bibliography{GOstats}


\end{document}