\documentclass[12pt]{article}
\title{BIOC 2005 Lab: From CEL Files to Annotated Lists of Interesting Genes}

\author{James W. MacDonald and Rafael A. Irizarry}


\usepackage{amsmath,pstricks, fullpage} 
\usepackage[authoryear,round]{natbib} 
\usepackage{theorem, float}
\usepackage{hyperref,graphicx}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunarg}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{\textit{#1}}
\theoremstyle{break} 
\newtheorem{Ex}{Exercise}


\textwidth=6.2in 
\textheight=8.5in 
\oddsidemargin=.1in 
\evensidemargin=.1in 
\headheight=-.3in

\usepackage{c:/rw2011/share/texmf/Sweave}
\begin{document}
\maketitle

\section{Introduction}

%% Do we need this?
%<<setup, echo=FALSE, results=hide>>=
%library("RbcBook1")
%@
<<echo=FALSE, results=hide>>=
options(width=70)
@ 

The predominant use for microarrays is the measurement of genome-wide expression
levels, and the most commonly used microarray platform is the Affymetrix GeneChip.
Affymetrix GeneChip arrays use short oligonucleotides to probe for genes in an
RNA sample. Genes are represented by a set of oligonucleotide probes each with
a length of 25 bases. Because of their short length, multiple probes are used
to improve specificity. Affymetrix arrays typically use between 11 and 20
probe pairs, referred to as a probeset, for each gene. One
component of these pairs is referred to as a perfect match probe
(PM) and is designed to hybridize only with transcripts from the
intended gene (specific hybridization). However, hybridization to the PM
probes by other mRNA species (non-specific hybridization) is
unavoidable. Therefore, the observed intensities need to be adjusted to be
accurately quantified. The other component of a probe pair, the
mismatch probe (MM), is constructed with the intention
of measuring only the non-specific component of the corresponding PM
probe. Affymetrix's strategy is to make MM probes identical to their PM
counterpart except that the 13-th base is exchanged with its complement.

The identification of genes that are differentially expressed \index{differential expression} in two
populations is a popular application of Affymetrix GeneChip
technology. Due to  
the cost of this technology, experiments using a small number of arrays
are common. A situation we often see is the case where three arrays are
used for each population. In this lab, we give an example of how
to quickly create lists of genes that are interesting in the sense that they
appear to be differentially expressed, 
starting from the raw probe level data (CEL files). \index{CEL file} 
%%See Labs~\ref{anal:diffexpr} and \ref{case:limma} for more on the analysis of differential expression.
In Section~\ref{import}, we briefly
describe the functions necessary to import the data into
Bioconductor. In Section~\ref{rma} we talk about preprocessing. In
Section~\ref{tests}, we describe ways to rank genes and decide
on a cutoff. Finally, in Section~\ref{annotation} we describe how to
make annotated reports and examine the PubMed literature related to
the genes in our list.

\section{Reading CEL files}
\label{import}
The \Rpackage{affy} package is needed to import the data into Bioconductor.
%%FIXME: so why hide it?
%%i was hidnig because of the long msg but lets see how it looks...
<<results=hide>>=
library("affy")
@ 
In the Affymetrix system, the raw image data are stored in so-called
DAT files. Currently, Bioconductor software does not
handle the image data, and the starting point for most analyses are the
CEL files.  These are the result of processing the
DAT files using Affymetrix software to produce estimated probe intensity
values. Each CEL file contains data on the intensity at each probe on the
GeneChip, as well as some other quantities.

To import CEL file data into the Bioconductor system,
we use the \Rfunction{ReadAffy} function. A
typical invocation is:
<<readaffy,eval=FALSE>>=
library("affy")
Data <- ReadAffy()
@
which will attempt to read all the CEL files in the current working directory
and place the probe-level data in an \Robject{AffyBatch}. If only a portion of 
the CEL files are to be read, one can use the \Rfunarg{filenames} argument to
select those files by name, or set the \Rfunarg{widget} to \Robject{TRUE} to 
use a graphical user interface to select files.
In this lab, we will not read in the
data. Instead we use an already created \Robject{AffyBatch} object, named 
\Robject{spikein95}, available through the package \Rpackage{SpikeInSubset}.

<<results=hide>>=
library("SpikeInSubset")
data(spikein95)
@

\index{phenoData}
An \Robject{AffyBatch} contains several \Robject{slots} that hold various
data:
<<>>=
slotNames(spikein95)
@ 
The \Robject{cdfName} slot contains the name of the Chip Description File,
which maps the probes to their physical location on the chip, the \Robject{nrow}
and \Robject{ncol} slots which give the dimensions of the chip, the \Robject{exprs}
slot which contains the probe level data (the \Robject{se.exprs} is not used at this
time), the \Robject{phenoData} slot which contains phenotypic data that describe
each sample, the \Robject{annotation} slot which contains the name of the annotation
package for the chip, the \Robject{description} slot which contains the 'MIAME'
information for the experiment (see ?MIAME for more information), and the \Robject{notes}
slot, which contains any notes about the experiment.

The \Robject{spikein95} \Robject{Affybatch} contains data from a six 
array subset (2 sets of triplicates) from a
calibration experiment performed by Affymetrix. For the purpose of this
experiment, we replace the current \Robject{phenoData}, describing the 
calibration experiment, with information needed for our
example dealing with two populations.

<<>>=
pd <- data.frame(population=c(1,1,1,2,2,2),
replicate=c(1,2,3,1,2,3))
rownames(pd) <- sampleNames(spikein95)
vl <- list(population="1 is control, 2 is treatment",
           replicate="arbitrary numbering") 
phenoData(spikein95) <- new("phenoData",pData=pd,varLabels=vl)
@ 
%
The assignment to \Robject{phenoData} above, can also be done using the
function \Rfunction{read.phenoData}.

There are various \emph{accessor} functions available for extracting
data from an \Robject{AffyBatch}. Two notable examples are the \Rfunction{pm}
and \Rfunction{mm} functions which are used to extract the PM and MM probe intensities.
One may also pass a vector of Affymetrix probe IDs as the second argument to these
functions to return intensities for only certain probesets. More information
about the \Robject{AffyBatch} class can be found in the help page (accessed by 
typing class ?Affybatch at an R prompt).

\subsection{Quality Control}

Obtaining gene expression measures for biological samples from
Affymetrix GeneChip microarrays is an elaborate process with many
potential sources of variation. In addition, the process is both
costly and time consuming. Therefore, it is critical to make the best
use of the information produced by the arrays, and to ascertain the
quality of this information. Unfortunately, data quality assessment is
complicated by the sheer volume of data involved and by the many
processing steps required to produce the expression measures for each
array. Furthermore, quality is a term which has some dependency on
context to which it is applied. From the viewpoint of a user of
GeneChip expression values, lower variability data, with all
other things being equal, should be judged to be of higher quality.

There are several Bioconductor packages designed for the analysis of
Affymetrix GeneChip data, each of which has a set of functions that can
be used to assess overall quality of the raw data. The \Rpackage{affy}
package supplies several functions for exploratory data analysis (EDA), 
which is often the most powerful way to detect either outlier or
high-variability chips.

<<results=hide>>=
library("ALLMLL")
data(MLL.B)
Data <- MLL.B[,c(2,1,3:5,14,6,13)]
sampleNames(Data) <- letters[1:8]
rm(MLL.B)
gc()
@ 

\begin{Ex}
The \Rpackage{ALLMLL} contains a subset of data from a large acute lymphoblastic leukemia
(ALL) study \citep{ross:etal:2004}. The above code loaded the package and subsetted 
the resulting \Robject{AffyBatch} to contain just eight samples.
\begin{itemize}
\item Using the \Rfunction{image} function, plot a false color image of the first chip. Notice
the obvious spatial artifact.
\item Using the \Rfunction{hist} function, plot smoothed densities for each chip. Note the 
bimodal distribution of the first chip - this is due to the spatial artifact noted above.
\item Look at a boxplot of the chips - the high background of the 'f' chip is obvious in this plot.
\end{itemize}
\end{Ex}
%% Maybe have them look at MA plots too?

\section{Preprocessing}
\label{rma}
\index{preprocessing}

After checking overall chip quality, the next step is to preprocess the data 
to obtain expression measures for each gene on each array. Preprocessing Affymetrix 
expression arrays usually involves three
steps: background adjustment, normalization, and summarization.

There are several methods available from Bioconductor for preprocessing Affymetrix 
GeneChip data. These methods include the \Rfunction{mas5} function, which very closely
duplicates the MAS5 method developed by Affymetrix \citep{Affymetrix:ExpressionAnalysisTechnicalManual:2003},
 the \Rfunction{fit.li.wong} function, which emulates the MBEI method \citep{LiWongPNAS2001}, 
the \Rfunction{rma} function \citep{Irizarry2003}, and \Rfunction{gcrma} \citep{Wu:etal:2004}. In
addition, the \Rfunction{expresso} function can be used to combine various background adjustment, 
normalization, and summarization methods.

Here we simply use 
\Rfunction{rma} to compute expression measures.
%%FIXME: don't hide things if you describe them
%%Fixed.. but i worry about the messages looking ugly.... put it back 
%%to where it was, becuase nothing is learned from the msg.
<<results=hide>>=
eset <- rma(spikein95)
@ 

The \Rfunction{rma} function will background correct, normalize, and
summarize the probe level data, as described by \cite{Irizarry2003}, into expression
level data. The expression values are in log base 2 scale. 
The information is saved in an instance of the \Robject{exprSet}
class, which contains similar information to the \Robject{AffyBatch}
class. Specifically, the  \Robject{exprSet} contains expression level as opposed to probe level data.

%%FIXME: except....; what is different
%%Fixed


A matrix with the expression information is readily available. The following code extracts the 
expression data and demonstrates the dimensions of the matrix storing the data: 
<<>>=
e <- exprs(eset)
dim(e)
@ 
We can see that in this matrix, rows represent genes and columns
represent arrays. To know which columns go with what population, we can
rely on the \Robject{phenoData} information inherited from
\Robject{spikein95}.  

<<>>=
pData(eset)
@ 
We can conveniently use \$ to access each column and create indexes 
denoting which columns represent each population:
<<>>=
Index1 <- which(eset$population==1)
Index2 <- which(eset$population==2)
@ 
We will use this information in the next section.

As mentioned above, we are only demonstrating the use of RMA.  Other options are
available through the functions \Rfunction{mas5} and
\Rfunction{expresso}. If you find \Rfunction{expresso} too slow, the
package \Rpackage{affyPLM} provides an alternative,
\Robject{threestep}, that is faster owing to use of C code.

\begin{Ex} \begin{itemize}\item Compute expression values for these data using the mas5 
function. Compare the first replicate from the two populations using an 
'M vs A' plot (see ?mva.plot) Hint: you will need to subset out the two chips
you want to compare. How does this compare to an M vs A plot for the same two 
samples using the rma expression measure?

\item The rma function fits a robust model to the data. As with all models, it is 
useful to look at the residuals to see how well the model fits the data. Using
the affyPLM package, compute rma expression measures and then plot residuals
for each chip. Hint: see ?rmaPLM and the ``Fitting Probe Level Models''
vignette in the affyPLM package.\end{itemize}\end{Ex}

\section{Ranking and filtering genes}
\label{tests}
\index{ranking}
Now we have, in \Robject{e}, a measurement $x_{ijk}$ of log (base 2)
expression from each gene $j$ on each array $i$ for both populations
$k=1,2$. For ranking purposes, it is convenient to quantify the average
level of differential expression for each gene. A naive first choice
is to simply consider the average log fold-change:
\[
d_j= \bar{x}_{j2}-\bar{x}_{j1}
\]
with
% FIXME2005 watch for leaks on right
\[
\ \bar{x}_{j2}=(x_{1j2}+x_{2j2}+x_{3j2})/3 \mbox{ and }
\bar{x}_{j1}=(x_{1j1}+x_{2j1}+x_{3j1})/3, \mbox{ for }j=1,\dots,J.
\]
To obtain $d$, we can use the \Rfunction{rowMeans} function, which provides a
much faster alternative to the commonly used function \Rfunction{apply}:
<<rowM>>=
d <- rowMeans(e[,Index2])-rowMeans(e[,Index1])
@ 

Various authors have noticed that the variability of fold-change
measurements usually depends on over-all expression of the gene in
question. This means that the evidence that large observed values of $d$
provide, should be judged by conditioning on some value 
representing over-all expression. An example is average log expression:  
<<>>=
a <- rowMeans(e)
@ 

The MA-plot~\ref{volcano}A shows $d$ plotted against $a$. 
We restrict the $y$-axis to log fold changes of less than
1 in absolute value (fold change smaller than 2). We do this because in this
analysis only one gene reached a fold change larger than 2. To see this we can
use the following line of code:
%
<<sumAbs>>=
sum(abs(d)>1)
@ 
%
<<maplot,fig=TRUE,include=FALSE,width=5,height=5,echo=FALSE,results=hide>>=
plot(a,d,ylim=c(-1,1),main="A) MA-plot",pch=".")

@ 


%------------------------------------------------------------
\subsection{Summary statistics and tests for ranking}
%------------------------------------------------------------
In Figure~\ref{volcano}A we can see how variance decreases with the value
of \Robject{a}. Also, different genes may have different levels of variation. 
Should we then consider a ranking procedure that takes
variance into account? A popular choice is the $t$-statistic. 
\index{$t$-statistic} 
The $t$-statistic
is a ratio comparing the effect size estimate $d$ to a sample based 
within group estimate of the standard error:

\[
s_j^2= \frac{1}{2}\sum_{i=1}^3 (x_{ij2}-\bar{x}_{j2})^2/3 + 
\frac{1}{2}\sum_{i=1}^3 (x_{ij1}-\bar{x}_{j1})^2/3, 
\mbox{ for }j=1,\dots,J.
\]

To calculate the $t$-statistic, $d_j/s_j$, we can
use the function \Rfunction{rowttests} from
the package \Rpackage{genefilter}, as demonstrated in the following code:
<<rowttests, results=hide>>=
library("genefilter")
tt <- rowttests(e, factor(eset$population))
@ 
%
The first argument is the data matrix, the second indicates the two
groups being compared.

Do our rankings change much if we use the $t$-statistic instead of average
log fold change? The volcano plot \index{volcano plot} is a useful way to see both these
quantities simultaneously. This figure plots $p$-values (more
specifically $-\log_{10}$ the $p$-value) versus effect
size. For simplicity we assume the $t$-statistic follows a
t-distribution.
To create a volcano plot, seen in Figure~\ref{volcano}B, we can use the following
simple code:
<<volcano,fig=TRUE,include=FALSE,width=5,height=5,echo=FALSE,results=hide>>=
lod <- -log10(tt$p.value)
plot(d,lod,cex=.25,main="B) Volcano plot for $t$-test")
abline(h=2)
@ 

This Figure demonstrates that the $t$-test and the average log
fold change give us different answers. 

\begin{figure}[H]
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=0.49\textwidth]{affylab-maplot}&
\includegraphics[width=0.49\textwidth]{affylab-volcano}\\
\includegraphics[width=0.49\textwidth]{affylab-volcano2}&
\includegraphics[width=0.49\textwidth]{affylab-volcano3}
\end{tabular}
\end{center}
\caption{A) MA-plot. B) Volcano plot for the $t$-test. C) As B) but with restricted $x$-axis, blue diamonds denoting the top 25 genes ranked by average fold change and red circles to denoting the top 25 genes ranked by smallest $p$-value. D) As C) but for the moderated $t$-test.}
\label{volcano}
\end{figure}

Figure~\ref{volcano}C is like Figure~\ref{volcano}B but we restrict the
$x$-axis to $[-1,1]$ to get a better view of most of the data.
We also add blue diamonds to denote the top 25 genes
ranked by average fold change and red circles to denote the top 25
genes ranked by smallest $p$-value.
The following lines of code permit us to create this figure:
<<volcano2,fig=TRUE,include=FALSE,width=5,height=5,echo=FALSE,results=hide>>=
lod <- -log10(tt$p.value)
plot(d,lod,cex=.25,main="B) Volcano plot for $t$-test")
abline(h=2)
@ 



%\begin{figure}[H]
%\begin{center}
%\includegraphics[width=0.49\textwidth]{fromcels-volcano2}
%\end{center}
%\caption{\label{volcano2}Volcano plot with restricted $x$-axis. Blue
%  diamonds denote the top 25 genes ranked by average fold change and
%  red circles to denote the top 25 genes ranked by smallest $p$-value.}
%\end{figure}

There is a relatively large disparity. Possible explanations are 1)
Some genes have larger variance than others. Large variance genes that
are not differentially expressed have a higher chance of having large log
fold changes. Because the $t$-statistics take variance into account,
these do not have small $p$-values. 2) With only three measurements per group the
estimate of the standard error of the effect size is not stable and
some genes have small $p$-values only because, by chance, the denominator of
the $t$-statistic was very small. Figure \ref{volcano}C demonstrates
both of these explanations are possible.


Do we use average fold change or the $t$-statistic to rank? They both
appear to have strengths and weaknesses. As mentioned, a problem with the
$t$-statistic is that there is not enough data to estimate
variances. Many researchers have proposed alternative statistics that
borrow strength across all genes to obtain a more stable estimate of
gene-specific variance.  The resulting statistics are referred to
as {\em modified} $t$-statistics. Because they reduce the possibility  
of large values they are also referred to as 
{\em penalized}, {\em attenuated}, or {\em regularized} $t$-statistics.
The are many versions of modified $t$-statistics, for example the statistic used by 
SAM \index{SAM} \citep{tusher:2001}, and it is impossible to describe   
them all in this lab.  

An example of such a modified $t$-statistic is given by \citet{smyth:2004} and
is available in the \Rpackage{limma} package. It is denoted the
\index{moderated $t$-statistic} {\em moderated} $t$-statistic. This statistic
is based on an empirical Bayes approach, described in detail by the above
mentioned references, can be implemented with the following code:
<<>>=
library("limma")
design <- model.matrix(~factor(eset$population))
fit <- lmFit(eset, design)
ebayes <- eBayes(fit)
@ 
For more details on the above code please see the \Rpackage{limma}
user's manual. 
One detail we will point out is that, as part of its output, 
the \Rfunction{lmFit} function produces the
values we have stored in \Robject{d} and \Robject{tstat} previously computed.

Figure \ref{volcano}D demonstrates the empirical Bayes approach
improves on the $t$-statistic in terms of not giving high rank to  
genes only because they have small sample variances. To see this
notice that we now have fewer genes with small $p$-values (large
in the $y$-axis) that have very small log fold changes (values close to 0 in
the $x$-axis).  %%FIXME:how? %%Fixed.  
<<volcano3,fig=TRUE,include=FALSE,width=5,height=5,echo=FALSE,results=hide>>=
lod <- -log10(ebayes$p.value[,2])
mtstat<- ebayes$t[,2]
o1 <- order(abs(d),decreasing=TRUE)[1:25]
o2 <- order(abs(mtstat),decreasing=TRUE)[1:25]
o <- union(o1,o2)
plot(d[-o],lod[-o],cex=.25,xlim=c(-1,1),ylim=c(0,4),
main="D) Volcano plot for moderated $t$-test")
points(d[o1],lod[o1],pch=18,col="blue")
points(d[o2],lod[o2],pch=1,col="red")
@ 
  


\subsection{Selecting cutoffs}
%%FIXME - refer to multtest lab
%%FIXED
We have presented three statistics that can be used to rank genes. Now
we turn our attention to deciding on a cutoff. A naive approach is to
consider genes attaining $p$-values less than 0.01. However, because we are
testing multiple hypotheses, the $p$-values no longer have the typical
meaning. Notice that if all the 12626 (number of genes on array) null
hypotheses are true (no genes are differentially expressed) we expect
$0.01 \times 12626 = 126.26$ false positives. In our data set we have
<<>>=
sum(tt$p.value<=0.01)
@ 

Various approaches have been suggested for dealing with the
multiple hypothesis problem. \index{multiple testing}
The suggested procedures usually provide adjusted $p$-values that 
can be used to decide on appropriate cutoffs. 
However, different procedures result in different interpretations of 
the resulting lists. 
In Section~\ref{annotation}, we demonstrate simple code that 
can be used, along with the results of the
\Rfunction{eBayes} function described above,
to construct lists of genes based on adjusted $p$-values.

An alternative \textit{ad hoc} procedure is to use the MA and volcano plots 
to look for groups of genes that appear to be departing from the
majority. For example, in Figure \ref{volcano}C we can restrict our
attention to the small cluster of genes in the upper right corner
having small $p$-value  and large fold changes. 

\subsection{Comparison}

We deviate slightly from our analysis to demonstrate that the
moderated $t$-statistic appears to perform better than average log fold
change and the $t$-test.

Because we are using a spike-in experiment we can assess the three
competing statistics. First we obtain the names of the genes that were
spiked-in. This is available in the original \Robject{phenoData}.
<<>>=
data(spikein95)
spikedin <- colnames(pData(spikein95))
spikedIndex <- match(spikedin,geneNames(eset))
@ 

Because in this experiment replicate RNA, except for the spiked-in
material, was hybridized to all arrays, only the 16 spiked-in genes
should be differentially expressed. This means that a perfect
ranking procedure would assign ranks from 1 to 16 to the 16 spiked-in genes.
The following code shows that the moderated $t$-statistic is
performing better than its three competitors.
<<>>=
d.rank <- sort(rank(-abs(d))[spikedIndex])
t.rank <- sort(rank(-abs(tt$statistic))[spikedIndex])
mt.rank <- sort(rank(-abs(mtstat))[spikedIndex])
ranks <- cbind(mt.rank,d.rank,t.rank)
rownames(ranks) <- NULL
ranks
@ 
The results above permit us to see how many true positives we would
have in a list of any size ranked by each of the three statistics. 
To do this, we simply count the
number of spiked-in genes with ranks smaller or equal to the size of the list.
As an example notice that in a list of the top 25 genes, 
we would obtain 12, 11, and 7 true
positives for the moderated $t$-statistic. 
This demonstrates that the choice of ranking statistic can have an effect on your final results. Also notice that 
this is just one analysis, and one would 
expect different results with a new data set. 
We encourage users to run this analysis on other data sets and assess the 
performance of different alternative procedures.
%%FIXME: and these are estimates with standard errors, and I am not
%% convinced on one run that you have the order right :-(
%%Fixed.


\section{Annotation}
\label{annotation}
\index{annotation}
Let us now consider a list of interesting genes and create a
report. As an example we will consider the top 10 genes as ranked by the 
moderated $t$-statistic. To do this we can use the \Rfunction{topTable} 
function from the   
\Rpackage{limma} package in the following way:  
<<>>=
tab <- topTable(ebayes,coef=2,adjust="fdr",n=10)
@ 
This function creates a table showing various interesting
statistics. This line of code shows us the first 5 entries:
<<>>=
tab[1:5,]
@ 
By specifying \Rfunarg{coef} as $2$ we define $d$ as our parameter of interest.
The argument \Rfunarg{n} defines how many top genes to
include in the table. Finally, through \Rfunarg{adjust} we are 
instructing \Rfunction{topTable} to calculate the false discovery
rate adjustment for the $p$-values obtained from \Rfunction{eBayes} using the 
procedure described by \citet{benjamini:1995}. See the limma user's manual
for more details.

In the remainder of this lab, we will use the Affymetrix identifiers to  
obtain biological meta-data information about these 10 genes. 
We can obtain these using the \Robject{ID} column of \Robject{tab}:  
<<>>= 
genenames <- as.character(tab$ID)
@
\subsection{Annotation package overview}

First, we will digress slightly to explain the various metaData packages that
are available from Bioconductor. The metaData packages can be divided into three groups:

\begin{itemize}
\item General annotation (GO, KEGG)
\item Species level annotation (xxxhomology, YEAST)
\item Chip level annotation (hgu133plus2, rat4302, etc.)
\end{itemize}

Each of these metaData packages contain one or more \emph{environments} which
are similar to hash tables. A hash table provides a very quick way of looking
up \emph{values} using a \emph{key}. We can use the functions \Rfunction{get}, 
and \Rfunction{mget} to extract data from a given environment. As an example, we
will use the \Rpackage{hgu95av2} package. Note that this package contains several
environments, most of which use the Affymetrix probe IDs as keys. 

<<>>=
library("hgu95av2")
hgu95av2()
gn <- mget(genenames, hgu95av2GENENAME)
sapply(gn, function(x) if(!is.na(x)) strwrap(x) else x)
@ 

One problem that may occur when using an environment for the first time is determining
what the keys are for that particular environment. The keys can be extracted using the 
\Rfunction{ls} function:

<<>>=
ls(hgu95av2GENENAME)[1:10]
@ 

The metaData packages are generally used in a step-wise manner, starting at the most
specific (chip-level), and proceeding to the most general. For instance, we could take
a set of Affymetrix probe IDs and map them to their respective Gene Ontology (GO)
terms using the \Robject{hgu95av2GO} environment. This will give us a list containing
all the GO IDs associated with each of the Affymetrix probe IDs. We could then use these
GO IDs (which are keys for the \Rpackage{GO} package) to map the Affymetrix probe IDs 
to the GO terms that are associated with each of the Affymetrix probe IDs.

<<>>=
library("GO")
library("annotate")
gos <- mget(genenames, hgu95av2GO)
gos[[1]][1:2]
goids <- sapply(gos, names)
goids[[1]]
goterms <- sapply(goids, function(x) mget(as.character(x), GOTERM))
sapply(goterms[[1]], Term)
@ 

\begin{Ex}
We have shown how to extract GO terms, starting with a set of Affymetrix
identifiers. The \Robject{hgu95av2PATH} environment maps Affymetrix IDs to the
Kyoto Encyclopedia of Genes and Geneomes (KEGG), which maintains pathway information
for various organisms. As can be seen above, only about 25\% of the Affymetrix IDs
actually map to a pathway -- the remaining key values simply return an \Robject{NA}.

Extract the first five KEGG pathway IDs from \Robject{hgu95av2PATH} and using the 
\Rpackage{KEGG} package, map them to the respective pathway name. See the help
page for \Robject{hgu95av2PATH} for a hint.
\end{Ex}

This section showed how to extract data from environments 'by hand'. Although
this sort of thing can be tedious, it is sometimes necessary, so it is worth knowing
how it is done. However, the \Rpackage{annotate} and \Rpackage{annaffy} packages
supply powerful tools for automated extraction of annotation information that can be 
used in most situations. In the following sections, we present some simple examples.



\subsection{PubMed abstracts}
%%FIXME: appropriate references to other parts of the monograph where they
%% can learn more
%%Fixed
The \Rpackage{annotate} package contains
many functions that can be used to query the PubMed
\index{PubMed} database and format the results.  Here we show how some of these can be used to 
obtain abstracts with references to a list of genes and report findings in the form of a Web page. 
Specifically, to get the abstract information we can use the
\Robject{pm.getabst} 
in the \Rpackage{annotate} package (which requires the \Robject{XML} package):
<<loadlibxml,results=hide>>=
library("XML")
absts<-pm.getabst(genenames,"hgu95av2")
@ 
The return value \Robject{absts} is a list of the same length as the
\Robject{genenames} argument. Each of the list's components 
corresponds to a gene and is itself a list. Each component of this list is an abstract for that gene. 
To see the fourth abstract for the first gene, we type the following:
<<absts12>>=
absts[[1]][[4]]
@ 
Please be careful using the function \Robject{pm.getabst} because it queries
PubMed directly. Too many queries of PubMed can get you banned.

To only see the titles for, say, the second probeset, we can type the
following. 
<<pmtit2>>=
## wh 13.01.05: commented out pm.titles since it behaves weird. 
## Revert back to it when it is fixed?
##titl <- pm.titles(absts[2])
titl <- sapply(absts[[2]], articleTitle)
strwrap(titl, simplify=FALSE)
@ 
Here we used the \Rfunction{strwrap} function to format the text
to fit the page width.

Having these abstracts as R objects is useful because, for example,
we can search for words in the abstract automatically. The code
below searches for the word ``protein'' in all the abstracts and
returns a logical value:
<<>>=
pro.res <- sapply(absts, function(x) pm.abstGrep("[Pp]rotein", x))
pro.res[[2]]
@ 

%%FIXME: annaffy, etc references
%%Fixed: see above
A convenient way to view the abstracts is to create a Web page:
<<>>=
pmAbst2HTML(absts[[2]],filename="pm.html")
@ 
The above code creates an HTML file with abstracts for the second
probeset in \Robject{genenames}.

\subsection{Generating reports}
The object \Robject{top} gives us valuable information about the top 10 genes. 
However, it is convenient to create a more informative table. We can
easily add annotation to this table and create an HTML document with
hyperlinks to sites with more information about the genes. The
following code obtains the Entrez Gene ID, UniGene ID, and symbol for each of our
selected genes.
<<>>=
ll <- getLL(genenames,"hgu95av2")
ug <- unlist(lookUp(genenames, "hgu95av2", "UNIGENE"))
sym <- getSYMBOL(genenames,"hgu95av2")
@ 
With these values available, we can use the following code to create an HTML 
page useful for sharing results with collaborators. 
%%FIXME: please do not use deprecated functions
%%Fixed
<<output>>=
tab <- data.frame(sym,tab[,-1])
htmlpage(list(ll, ug),filename="report.html",title="HTML report",othernames=tab,
         table.head=c("Locus ID", "UniGene ID", colnames(tab)),table.center=TRUE, 
          repository=list("ll","ug"))
@ 

This HTML report contains links to both Entrez Gene and UniGene. However, \Rfunction{htmlpage}
can only make links to a limited number of online databases, and is primarily intended
for microarrays that do not have chip-level annotation packages. For those microarrays that
do have annotation packages, we can use the
\Rpackage{annaffy} package. Below are a few lines of code that create
a useful HTML report \index{HTML report} with links to various annotation sites.
<<results=hide>>=
library("annaffy")
atab <- aafTableAnn( genenames, "hgu95av2", aaf.handler() )
saveHTML(atab, file="report2.html")
@ 

\begin{Ex}
The \Rpackage{annaffy} package can be used to create HTML (and text) tables that contain
not only links to various online databases, but that contain additional information as well.
It useful to append the expression values and statistics for the genes that are 
being annotated, which gives you and your collaborators the ability to perform 'sanity checks'
to ensure that no coding errors have been made in selecting a set of genes.

Reproduce the above HTML table, but include the t-statistic, p-value, and expression values
for each gene. Also limit the number of links to only include the Affymetrix ID, UniGene,
Entrez Gene, and PubMed.
\end{Ex}

\subsection{Statistics based on Annotation}

Given a set of differentially regulated genes, it is fairly common to ask if certain
pathways are being perturbed in some way. This is a difficult question to answer given the
fact that most genes (75\% of the genes on the HG-U95Av2 chip, for instance) are not mapped to 
any pathway. As a surrogate, we can look for GO terms that appear to be 'concentrated' in 
a set of differentially expressed genes using the \Rfunction{GOHyperG} function in the
\Rpackage{GOstats} package. If we find that the number of genes with a certain GO term
are overrepresented in the set of differentially expressed genes, then this may imply that 
a particular pathway or function is being affected.

<<results=hide>>=
library("GOstats")
@ 
<<>>=
index <- sample(1:12000, 100)
gn <- ls(hgu95av2GENENAME)[index]
lls <- unique(getLL(gn, "hgu95av2"))
hyp <- GOHyperG(lls, "hgu95av2", what = "MF")
index2 <- hyp$pvalues < 0.05
wh <- mget(names(hyp$pvalues)[index2], GOTERM)
terms <- sapply(wh, Term)
strwrap(terms[1:10], simplify = FALSE)
hyp$pvalues[index2][1:10]
@ 

\begin{Ex}
The above code is an example of something that could easily become quite tedious
if it had to be repeated more than one or two times. It is often worth the time and
effort to write a small 'wrapper' function that will perform all the required steps
to take e.g., a vector of Affymetrix probe IDs and output a \Robject{data.frame} containing
the GO ID, GO term, p-value, the number of genes with that GO term in the set of significant
genes, and the number on the chip. The last two bits of information are good to know
because it is easy to get a very small p-value when there are few genes with a given GO term 
on the chip (e.g., if there are three genes with a GO term XXX, and one is in the list of
significant genes, you will get a small p-value, but it is probably not an interesting 
result).

If we have time (and you feel up to it), write a wrapper function to do this, outputting the
results as a \Robject{data.frame}.
\end{Ex}


\section{Conclusion}
%%inconsistent reference to cel files :-(
%%Fixed
We have demonstrated how one can use Bioconductor tools to go from the
raw data in CEL files to annotated reports of a select group of
interesting genes. We presented only one way of doing this. There are
many alternatives also available via Bioconductor. We hope that this
lab serves as an example of a useful analysis and also of the
flexibility of Bioconductor.

\bibliographystyle{plainnat}
\bibliography{affylab.bib}

\end{document}