\documentclass[12pt]{article} \title{BIOC 2005 Lab: From CEL Files to Annotated Lists of Interesting Genes} \author{James W. MacDonald and Rafael A. Irizarry} \usepackage{amsmath,pstricks, fullpage} \usepackage[authoryear,round]{natbib} \usepackage{theorem, float} \usepackage{hyperref,graphicx} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunarg}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\textit{#1}} \theoremstyle{break} \newtheorem{Ex}{Exercise} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \usepackage{c:/rw2011/share/texmf/Sweave} \begin{document} \maketitle \section{Introduction} %% Do we need this? %<>= %library("RbcBook1") %@ <>= options(width=70) @ The predominant use for microarrays is the measurement of genome-wide expression levels, and the most commonly used microarray platform is the Affymetrix GeneChip. Affymetrix GeneChip arrays use short oligonucleotides to probe for genes in an RNA sample. Genes are represented by a set of oligonucleotide probes each with a length of 25 bases. Because of their short length, multiple probes are used to improve specificity. Affymetrix arrays typically use between 11 and 20 probe pairs, referred to as a probeset, for each gene. One component of these pairs is referred to as a perfect match probe (PM) and is designed to hybridize only with transcripts from the intended gene (specific hybridization). However, hybridization to the PM probes by other mRNA species (non-specific hybridization) is unavoidable. Therefore, the observed intensities need to be adjusted to be accurately quantified. The other component of a probe pair, the mismatch probe (MM), is constructed with the intention of measuring only the non-specific component of the corresponding PM probe. Affymetrix's strategy is to make MM probes identical to their PM counterpart except that the 13-th base is exchanged with its complement. The identification of genes that are differentially expressed \index{differential expression} in two populations is a popular application of Affymetrix GeneChip technology. Due to the cost of this technology, experiments using a small number of arrays are common. A situation we often see is the case where three arrays are used for each population. In this lab, we give an example of how to quickly create lists of genes that are interesting in the sense that they appear to be differentially expressed, starting from the raw probe level data (CEL files). \index{CEL file} %%See Labs~\ref{anal:diffexpr} and \ref{case:limma} for more on the analysis of differential expression. In Section~\ref{import}, we briefly describe the functions necessary to import the data into Bioconductor. In Section~\ref{rma} we talk about preprocessing. In Section~\ref{tests}, we describe ways to rank genes and decide on a cutoff. Finally, in Section~\ref{annotation} we describe how to make annotated reports and examine the PubMed literature related to the genes in our list. \section{Reading CEL files} \label{import} The \Rpackage{affy} package is needed to import the data into Bioconductor. %%FIXME: so why hide it? %%i was hidnig because of the long msg but lets see how it looks... <>= library("affy") @ In the Affymetrix system, the raw image data are stored in so-called DAT files. Currently, Bioconductor software does not handle the image data, and the starting point for most analyses are the CEL files. These are the result of processing the DAT files using Affymetrix software to produce estimated probe intensity values. Each CEL file contains data on the intensity at each probe on the GeneChip, as well as some other quantities. To import CEL file data into the Bioconductor system, we use the \Rfunction{ReadAffy} function. A typical invocation is: <>= library("affy") Data <- ReadAffy() @ which will attempt to read all the CEL files in the current working directory and place the probe-level data in an \Robject{AffyBatch}. If only a portion of the CEL files are to be read, one can use the \Rfunarg{filenames} argument to select those files by name, or set the \Rfunarg{widget} to \Robject{TRUE} to use a graphical user interface to select files. In this lab, we will not read in the data. Instead we use an already created \Robject{AffyBatch} object, named \Robject{spikein95}, available through the package \Rpackage{SpikeInSubset}. <>= library("SpikeInSubset") data(spikein95) @ \index{phenoData} An \Robject{AffyBatch} contains several \Robject{slots} that hold various data: <<>>= slotNames(spikein95) @ The \Robject{cdfName} slot contains the name of the Chip Description File, which maps the probes to their physical location on the chip, the \Robject{nrow} and \Robject{ncol} slots which give the dimensions of the chip, the \Robject{exprs} slot which contains the probe level data (the \Robject{se.exprs} is not used at this time), the \Robject{phenoData} slot which contains phenotypic data that describe each sample, the \Robject{annotation} slot which contains the name of the annotation package for the chip, the \Robject{description} slot which contains the 'MIAME' information for the experiment (see ?MIAME for more information), and the \Robject{notes} slot, which contains any notes about the experiment. The \Robject{spikein95} \Robject{Affybatch} contains data from a six array subset (2 sets of triplicates) from a calibration experiment performed by Affymetrix. For the purpose of this experiment, we replace the current \Robject{phenoData}, describing the calibration experiment, with information needed for our example dealing with two populations. <<>>= pd <- data.frame(population=c(1,1,1,2,2,2), replicate=c(1,2,3,1,2,3)) rownames(pd) <- sampleNames(spikein95) vl <- list(population="1 is control, 2 is treatment", replicate="arbitrary numbering") phenoData(spikein95) <- new("phenoData",pData=pd,varLabels=vl) @ % The assignment to \Robject{phenoData} above, can also be done using the function \Rfunction{read.phenoData}. There are various \emph{accessor} functions available for extracting data from an \Robject{AffyBatch}. Two notable examples are the \Rfunction{pm} and \Rfunction{mm} functions which are used to extract the PM and MM probe intensities. One may also pass a vector of Affymetrix probe IDs as the second argument to these functions to return intensities for only certain probesets. More information about the \Robject{AffyBatch} class can be found in the help page (accessed by typing class ?Affybatch at an R prompt). \subsection{Quality Control} Obtaining gene expression measures for biological samples from Affymetrix GeneChip microarrays is an elaborate process with many potential sources of variation. In addition, the process is both costly and time consuming. Therefore, it is critical to make the best use of the information produced by the arrays, and to ascertain the quality of this information. Unfortunately, data quality assessment is complicated by the sheer volume of data involved and by the many processing steps required to produce the expression measures for each array. Furthermore, quality is a term which has some dependency on context to which it is applied. From the viewpoint of a user of GeneChip expression values, lower variability data, with all other things being equal, should be judged to be of higher quality. There are several Bioconductor packages designed for the analysis of Affymetrix GeneChip data, each of which has a set of functions that can be used to assess overall quality of the raw data. The \Rpackage{affy} package supplies several functions for exploratory data analysis (EDA), which is often the most powerful way to detect either outlier or high-variability chips. <>= library("ALLMLL") data(MLL.B) Data <- MLL.B[,c(2,1,3:5,14,6,13)] sampleNames(Data) <- letters[1:8] rm(MLL.B) gc() @ \begin{Ex} The \Rpackage{ALLMLL} contains a subset of data from a large acute lymphoblastic leukemia (ALL) study \citep{ross:etal:2004}. The above code loaded the package and subsetted the resulting \Robject{AffyBatch} to contain just eight samples. \begin{itemize} \item Using the \Rfunction{image} function, plot a false color image of the first chip. Notice the obvious spatial artifact. \item Using the \Rfunction{hist} function, plot smoothed densities for each chip. Note the bimodal distribution of the first chip - this is due to the spatial artifact noted above. \item Look at a boxplot of the chips - the high background of the 'f' chip is obvious in this plot. \end{itemize} \end{Ex} %% Maybe have them look at MA plots too? \section{Preprocessing} \label{rma} \index{preprocessing} After checking overall chip quality, the next step is to preprocess the data to obtain expression measures for each gene on each array. Preprocessing Affymetrix expression arrays usually involves three steps: background adjustment, normalization, and summarization. There are several methods available from Bioconductor for preprocessing Affymetrix GeneChip data. These methods include the \Rfunction{mas5} function, which very closely duplicates the MAS5 method developed by Affymetrix \citep{Affymetrix:ExpressionAnalysisTechnicalManual:2003}, the \Rfunction{fit.li.wong} function, which emulates the MBEI method \citep{LiWongPNAS2001}, the \Rfunction{rma} function \citep{Irizarry2003}, and \Rfunction{gcrma} \citep{Wu:etal:2004}. In addition, the \Rfunction{expresso} function can be used to combine various background adjustment, normalization, and summarization methods. Here we simply use \Rfunction{rma} to compute expression measures. %%FIXME: don't hide things if you describe them %%Fixed.. but i worry about the messages looking ugly.... put it back %%to where it was, becuase nothing is learned from the msg. <>= eset <- rma(spikein95) @ The \Rfunction{rma} function will background correct, normalize, and summarize the probe level data, as described by \cite{Irizarry2003}, into expression level data. The expression values are in log base 2 scale. The information is saved in an instance of the \Robject{exprSet} class, which contains similar information to the \Robject{AffyBatch} class. Specifically, the \Robject{exprSet} contains expression level as opposed to probe level data. %%FIXME: except....; what is different %%Fixed A matrix with the expression information is readily available. The following code extracts the expression data and demonstrates the dimensions of the matrix storing the data: <<>>= e <- exprs(eset) dim(e) @ We can see that in this matrix, rows represent genes and columns represent arrays. To know which columns go with what population, we can rely on the \Robject{phenoData} information inherited from \Robject{spikein95}. <<>>= pData(eset) @ We can conveniently use \$ to access each column and create indexes denoting which columns represent each population: <<>>= Index1 <- which(eset$population==1) Index2 <- which(eset$population==2) @ We will use this information in the next section. As mentioned above, we are only demonstrating the use of RMA. Other options are available through the functions \Rfunction{mas5} and \Rfunction{expresso}. If you find \Rfunction{expresso} too slow, the package \Rpackage{affyPLM} provides an alternative, \Robject{threestep}, that is faster owing to use of C code. \begin{Ex} \begin{itemize}\item Compute expression values for these data using the mas5 function. Compare the first replicate from the two populations using an 'M vs A' plot (see ?mva.plot) Hint: you will need to subset out the two chips you want to compare. How does this compare to an M vs A plot for the same two samples using the rma expression measure? \item The rma function fits a robust model to the data. As with all models, it is useful to look at the residuals to see how well the model fits the data. Using the affyPLM package, compute rma expression measures and then plot residuals for each chip. Hint: see ?rmaPLM and the ``Fitting Probe Level Models'' vignette in the affyPLM package.\end{itemize}\end{Ex} \section{Ranking and filtering genes} \label{tests} \index{ranking} Now we have, in \Robject{e}, a measurement $x_{ijk}$ of log (base 2) expression from each gene $j$ on each array $i$ for both populations $k=1,2$. For ranking purposes, it is convenient to quantify the average level of differential expression for each gene. A naive first choice is to simply consider the average log fold-change: \[ d_j= \bar{x}_{j2}-\bar{x}_{j1} \] with % FIXME2005 watch for leaks on right \[ \ \bar{x}_{j2}=(x_{1j2}+x_{2j2}+x_{3j2})/3 \mbox{ and } \bar{x}_{j1}=(x_{1j1}+x_{2j1}+x_{3j1})/3, \mbox{ for }j=1,\dots,J. \] To obtain $d$, we can use the \Rfunction{rowMeans} function, which provides a much faster alternative to the commonly used function \Rfunction{apply}: <>= d <- rowMeans(e[,Index2])-rowMeans(e[,Index1]) @ Various authors have noticed that the variability of fold-change measurements usually depends on over-all expression of the gene in question. This means that the evidence that large observed values of $d$ provide, should be judged by conditioning on some value representing over-all expression. An example is average log expression: <<>>= a <- rowMeans(e) @ The MA-plot~\ref{volcano}A shows $d$ plotted against $a$. We restrict the $y$-axis to log fold changes of less than 1 in absolute value (fold change smaller than 2). We do this because in this analysis only one gene reached a fold change larger than 2. To see this we can use the following line of code: % <>= sum(abs(d)>1) @ % <>= plot(a,d,ylim=c(-1,1),main="A) MA-plot",pch=".") @ %------------------------------------------------------------ \subsection{Summary statistics and tests for ranking} %------------------------------------------------------------ In Figure~\ref{volcano}A we can see how variance decreases with the value of \Robject{a}. Also, different genes may have different levels of variation. Should we then consider a ranking procedure that takes variance into account? A popular choice is the $t$-statistic. \index{$t$-statistic} The $t$-statistic is a ratio comparing the effect size estimate $d$ to a sample based within group estimate of the standard error: \[ s_j^2= \frac{1}{2}\sum_{i=1}^3 (x_{ij2}-\bar{x}_{j2})^2/3 + \frac{1}{2}\sum_{i=1}^3 (x_{ij1}-\bar{x}_{j1})^2/3, \mbox{ for }j=1,\dots,J. \] To calculate the $t$-statistic, $d_j/s_j$, we can use the function \Rfunction{rowttests} from the package \Rpackage{genefilter}, as demonstrated in the following code: <>= library("genefilter") tt <- rowttests(e, factor(eset$population)) @ % The first argument is the data matrix, the second indicates the two groups being compared. Do our rankings change much if we use the $t$-statistic instead of average log fold change? The volcano plot \index{volcano plot} is a useful way to see both these quantities simultaneously. This figure plots $p$-values (more specifically $-\log_{10}$ the $p$-value) versus effect size. For simplicity we assume the $t$-statistic follows a t-distribution. To create a volcano plot, seen in Figure~\ref{volcano}B, we can use the following simple code: <>= lod <- -log10(tt$p.value) plot(d,lod,cex=.25,main="B) Volcano plot for $t$-test") abline(h=2) @ This Figure demonstrates that the $t$-test and the average log fold change give us different answers. \begin{figure}[H] \begin{center} \begin{tabular}{cc} \includegraphics[width=0.49\textwidth]{affylab-maplot}& \includegraphics[width=0.49\textwidth]{affylab-volcano}\\ \includegraphics[width=0.49\textwidth]{affylab-volcano2}& \includegraphics[width=0.49\textwidth]{affylab-volcano3} \end{tabular} \end{center} \caption{A) MA-plot. B) Volcano plot for the $t$-test. C) As B) but with restricted $x$-axis, blue diamonds denoting the top 25 genes ranked by average fold change and red circles to denoting the top 25 genes ranked by smallest $p$-value. D) As C) but for the moderated $t$-test.} \label{volcano} \end{figure} Figure~\ref{volcano}C is like Figure~\ref{volcano}B but we restrict the $x$-axis to $[-1,1]$ to get a better view of most of the data. We also add blue diamonds to denote the top 25 genes ranked by average fold change and red circles to denote the top 25 genes ranked by smallest $p$-value. The following lines of code permit us to create this figure: <>= lod <- -log10(tt$p.value) plot(d,lod,cex=.25,main="B) Volcano plot for $t$-test") abline(h=2) @ %\begin{figure}[H] %\begin{center} %\includegraphics[width=0.49\textwidth]{fromcels-volcano2} %\end{center} %\caption{\label{volcano2}Volcano plot with restricted $x$-axis. Blue % diamonds denote the top 25 genes ranked by average fold change and % red circles to denote the top 25 genes ranked by smallest $p$-value.} %\end{figure} There is a relatively large disparity. Possible explanations are 1) Some genes have larger variance than others. Large variance genes that are not differentially expressed have a higher chance of having large log fold changes. Because the $t$-statistics take variance into account, these do not have small $p$-values. 2) With only three measurements per group the estimate of the standard error of the effect size is not stable and some genes have small $p$-values only because, by chance, the denominator of the $t$-statistic was very small. Figure \ref{volcano}C demonstrates both of these explanations are possible. Do we use average fold change or the $t$-statistic to rank? They both appear to have strengths and weaknesses. As mentioned, a problem with the $t$-statistic is that there is not enough data to estimate variances. Many researchers have proposed alternative statistics that borrow strength across all genes to obtain a more stable estimate of gene-specific variance. The resulting statistics are referred to as {\em modified} $t$-statistics. Because they reduce the possibility of large values they are also referred to as {\em penalized}, {\em attenuated}, or {\em regularized} $t$-statistics. The are many versions of modified $t$-statistics, for example the statistic used by SAM \index{SAM} \citep{tusher:2001}, and it is impossible to describe them all in this lab. An example of such a modified $t$-statistic is given by \citet{smyth:2004} and is available in the \Rpackage{limma} package. It is denoted the \index{moderated $t$-statistic} {\em moderated} $t$-statistic. This statistic is based on an empirical Bayes approach, described in detail by the above mentioned references, can be implemented with the following code: <<>>= library("limma") design <- model.matrix(~factor(eset$population)) fit <- lmFit(eset, design) ebayes <- eBayes(fit) @ For more details on the above code please see the \Rpackage{limma} user's manual. One detail we will point out is that, as part of its output, the \Rfunction{lmFit} function produces the values we have stored in \Robject{d} and \Robject{tstat} previously computed. Figure \ref{volcano}D demonstrates the empirical Bayes approach improves on the $t$-statistic in terms of not giving high rank to genes only because they have small sample variances. To see this notice that we now have fewer genes with small $p$-values (large in the $y$-axis) that have very small log fold changes (values close to 0 in the $x$-axis). %%FIXME:how? %%Fixed. <>= lod <- -log10(ebayes$p.value[,2]) mtstat<- ebayes$t[,2] o1 <- order(abs(d),decreasing=TRUE)[1:25] o2 <- order(abs(mtstat),decreasing=TRUE)[1:25] o <- union(o1,o2) plot(d[-o],lod[-o],cex=.25,xlim=c(-1,1),ylim=c(0,4), main="D) Volcano plot for moderated $t$-test") points(d[o1],lod[o1],pch=18,col="blue") points(d[o2],lod[o2],pch=1,col="red") @ \subsection{Selecting cutoffs} %%FIXME - refer to multtest lab %%FIXED We have presented three statistics that can be used to rank genes. Now we turn our attention to deciding on a cutoff. A naive approach is to consider genes attaining $p$-values less than 0.01. However, because we are testing multiple hypotheses, the $p$-values no longer have the typical meaning. Notice that if all the 12626 (number of genes on array) null hypotheses are true (no genes are differentially expressed) we expect $0.01 \times 12626 = 126.26$ false positives. In our data set we have <<>>= sum(tt$p.value<=0.01) @ Various approaches have been suggested for dealing with the multiple hypothesis problem. \index{multiple testing} The suggested procedures usually provide adjusted $p$-values that can be used to decide on appropriate cutoffs. However, different procedures result in different interpretations of the resulting lists. In Section~\ref{annotation}, we demonstrate simple code that can be used, along with the results of the \Rfunction{eBayes} function described above, to construct lists of genes based on adjusted $p$-values. An alternative \textit{ad hoc} procedure is to use the MA and volcano plots to look for groups of genes that appear to be departing from the majority. For example, in Figure \ref{volcano}C we can restrict our attention to the small cluster of genes in the upper right corner having small $p$-value and large fold changes. \subsection{Comparison} We deviate slightly from our analysis to demonstrate that the moderated $t$-statistic appears to perform better than average log fold change and the $t$-test. Because we are using a spike-in experiment we can assess the three competing statistics. First we obtain the names of the genes that were spiked-in. This is available in the original \Robject{phenoData}. <<>>= data(spikein95) spikedin <- colnames(pData(spikein95)) spikedIndex <- match(spikedin,geneNames(eset)) @ Because in this experiment replicate RNA, except for the spiked-in material, was hybridized to all arrays, only the 16 spiked-in genes should be differentially expressed. This means that a perfect ranking procedure would assign ranks from 1 to 16 to the 16 spiked-in genes. The following code shows that the moderated $t$-statistic is performing better than its three competitors. <<>>= d.rank <- sort(rank(-abs(d))[spikedIndex]) t.rank <- sort(rank(-abs(tt$statistic))[spikedIndex]) mt.rank <- sort(rank(-abs(mtstat))[spikedIndex]) ranks <- cbind(mt.rank,d.rank,t.rank) rownames(ranks) <- NULL ranks @ The results above permit us to see how many true positives we would have in a list of any size ranked by each of the three statistics. To do this, we simply count the number of spiked-in genes with ranks smaller or equal to the size of the list. As an example notice that in a list of the top 25 genes, we would obtain 12, 11, and 7 true positives for the moderated $t$-statistic. This demonstrates that the choice of ranking statistic can have an effect on your final results. Also notice that this is just one analysis, and one would expect different results with a new data set. We encourage users to run this analysis on other data sets and assess the performance of different alternative procedures. %%FIXME: and these are estimates with standard errors, and I am not %% convinced on one run that you have the order right :-( %%Fixed. \section{Annotation} \label{annotation} \index{annotation} Let us now consider a list of interesting genes and create a report. As an example we will consider the top 10 genes as ranked by the moderated $t$-statistic. To do this we can use the \Rfunction{topTable} function from the \Rpackage{limma} package in the following way: <<>>= tab <- topTable(ebayes,coef=2,adjust="fdr",n=10) @ This function creates a table showing various interesting statistics. This line of code shows us the first 5 entries: <<>>= tab[1:5,] @ By specifying \Rfunarg{coef} as $2$ we define $d$ as our parameter of interest. The argument \Rfunarg{n} defines how many top genes to include in the table. Finally, through \Rfunarg{adjust} we are instructing \Rfunction{topTable} to calculate the false discovery rate adjustment for the $p$-values obtained from \Rfunction{eBayes} using the procedure described by \citet{benjamini:1995}. See the limma user's manual for more details. In the remainder of this lab, we will use the Affymetrix identifiers to obtain biological meta-data information about these 10 genes. We can obtain these using the \Robject{ID} column of \Robject{tab}: <<>>= genenames <- as.character(tab$ID) @ \subsection{Annotation package overview} First, we will digress slightly to explain the various metaData packages that are available from Bioconductor. The metaData packages can be divided into three groups: \begin{itemize} \item General annotation (GO, KEGG) \item Species level annotation (xxxhomology, YEAST) \item Chip level annotation (hgu133plus2, rat4302, etc.) \end{itemize} Each of these metaData packages contain one or more \emph{environments} which are similar to hash tables. A hash table provides a very quick way of looking up \emph{values} using a \emph{key}. We can use the functions \Rfunction{get}, and \Rfunction{mget} to extract data from a given environment. As an example, we will use the \Rpackage{hgu95av2} package. Note that this package contains several environments, most of which use the Affymetrix probe IDs as keys. <<>>= library("hgu95av2") hgu95av2() gn <- mget(genenames, hgu95av2GENENAME) sapply(gn, function(x) if(!is.na(x)) strwrap(x) else x) @ One problem that may occur when using an environment for the first time is determining what the keys are for that particular environment. The keys can be extracted using the \Rfunction{ls} function: <<>>= ls(hgu95av2GENENAME)[1:10] @ The metaData packages are generally used in a step-wise manner, starting at the most specific (chip-level), and proceeding to the most general. For instance, we could take a set of Affymetrix probe IDs and map them to their respective Gene Ontology (GO) terms using the \Robject{hgu95av2GO} environment. This will give us a list containing all the GO IDs associated with each of the Affymetrix probe IDs. We could then use these GO IDs (which are keys for the \Rpackage{GO} package) to map the Affymetrix probe IDs to the GO terms that are associated with each of the Affymetrix probe IDs. <<>>= library("GO") library("annotate") gos <- mget(genenames, hgu95av2GO) gos[[1]][1:2] goids <- sapply(gos, names) goids[[1]] goterms <- sapply(goids, function(x) mget(as.character(x), GOTERM)) sapply(goterms[[1]], Term) @ \begin{Ex} We have shown how to extract GO terms, starting with a set of Affymetrix identifiers. The \Robject{hgu95av2PATH} environment maps Affymetrix IDs to the Kyoto Encyclopedia of Genes and Geneomes (KEGG), which maintains pathway information for various organisms. As can be seen above, only about 25\% of the Affymetrix IDs actually map to a pathway -- the remaining key values simply return an \Robject{NA}. Extract the first five KEGG pathway IDs from \Robject{hgu95av2PATH} and using the \Rpackage{KEGG} package, map them to the respective pathway name. See the help page for \Robject{hgu95av2PATH} for a hint. \end{Ex} This section showed how to extract data from environments 'by hand'. Although this sort of thing can be tedious, it is sometimes necessary, so it is worth knowing how it is done. However, the \Rpackage{annotate} and \Rpackage{annaffy} packages supply powerful tools for automated extraction of annotation information that can be used in most situations. In the following sections, we present some simple examples. \subsection{PubMed abstracts} %%FIXME: appropriate references to other parts of the monograph where they %% can learn more %%Fixed The \Rpackage{annotate} package contains many functions that can be used to query the PubMed \index{PubMed} database and format the results. Here we show how some of these can be used to obtain abstracts with references to a list of genes and report findings in the form of a Web page. Specifically, to get the abstract information we can use the \Robject{pm.getabst} in the \Rpackage{annotate} package (which requires the \Robject{XML} package): <>= library("XML") absts<-pm.getabst(genenames,"hgu95av2") @ The return value \Robject{absts} is a list of the same length as the \Robject{genenames} argument. Each of the list's components corresponds to a gene and is itself a list. Each component of this list is an abstract for that gene. To see the fourth abstract for the first gene, we type the following: <>= absts[[1]][[4]] @ Please be careful using the function \Robject{pm.getabst} because it queries PubMed directly. Too many queries of PubMed can get you banned. To only see the titles for, say, the second probeset, we can type the following. <>= ## wh 13.01.05: commented out pm.titles since it behaves weird. ## Revert back to it when it is fixed? ##titl <- pm.titles(absts[2]) titl <- sapply(absts[[2]], articleTitle) strwrap(titl, simplify=FALSE) @ Here we used the \Rfunction{strwrap} function to format the text to fit the page width. Having these abstracts as R objects is useful because, for example, we can search for words in the abstract automatically. The code below searches for the word ``protein'' in all the abstracts and returns a logical value: <<>>= pro.res <- sapply(absts, function(x) pm.abstGrep("[Pp]rotein", x)) pro.res[[2]] @ %%FIXME: annaffy, etc references %%Fixed: see above A convenient way to view the abstracts is to create a Web page: <<>>= pmAbst2HTML(absts[[2]],filename="pm.html") @ The above code creates an HTML file with abstracts for the second probeset in \Robject{genenames}. \subsection{Generating reports} The object \Robject{top} gives us valuable information about the top 10 genes. However, it is convenient to create a more informative table. We can easily add annotation to this table and create an HTML document with hyperlinks to sites with more information about the genes. The following code obtains the Entrez Gene ID, UniGene ID, and symbol for each of our selected genes. <<>>= ll <- getLL(genenames,"hgu95av2") ug <- unlist(lookUp(genenames, "hgu95av2", "UNIGENE")) sym <- getSYMBOL(genenames,"hgu95av2") @ With these values available, we can use the following code to create an HTML page useful for sharing results with collaborators. %%FIXME: please do not use deprecated functions %%Fixed <>= tab <- data.frame(sym,tab[,-1]) htmlpage(list(ll, ug),filename="report.html",title="HTML report",othernames=tab, table.head=c("Locus ID", "UniGene ID", colnames(tab)),table.center=TRUE, repository=list("ll","ug")) @ This HTML report contains links to both Entrez Gene and UniGene. However, \Rfunction{htmlpage} can only make links to a limited number of online databases, and is primarily intended for microarrays that do not have chip-level annotation packages. For those microarrays that do have annotation packages, we can use the \Rpackage{annaffy} package. Below are a few lines of code that create a useful HTML report \index{HTML report} with links to various annotation sites. <>= library("annaffy") atab <- aafTableAnn( genenames, "hgu95av2", aaf.handler() ) saveHTML(atab, file="report2.html") @ \begin{Ex} The \Rpackage{annaffy} package can be used to create HTML (and text) tables that contain not only links to various online databases, but that contain additional information as well. It useful to append the expression values and statistics for the genes that are being annotated, which gives you and your collaborators the ability to perform 'sanity checks' to ensure that no coding errors have been made in selecting a set of genes. Reproduce the above HTML table, but include the t-statistic, p-value, and expression values for each gene. Also limit the number of links to only include the Affymetrix ID, UniGene, Entrez Gene, and PubMed. \end{Ex} \subsection{Statistics based on Annotation} Given a set of differentially regulated genes, it is fairly common to ask if certain pathways are being perturbed in some way. This is a difficult question to answer given the fact that most genes (75\% of the genes on the HG-U95Av2 chip, for instance) are not mapped to any pathway. As a surrogate, we can look for GO terms that appear to be 'concentrated' in a set of differentially expressed genes using the \Rfunction{GOHyperG} function in the \Rpackage{GOstats} package. If we find that the number of genes with a certain GO term are overrepresented in the set of differentially expressed genes, then this may imply that a particular pathway or function is being affected. <>= library("GOstats") @ <<>>= index <- sample(1:12000, 100) gn <- ls(hgu95av2GENENAME)[index] lls <- unique(getLL(gn, "hgu95av2")) hyp <- GOHyperG(lls, "hgu95av2", what = "MF") index2 <- hyp$pvalues < 0.05 wh <- mget(names(hyp$pvalues)[index2], GOTERM) terms <- sapply(wh, Term) strwrap(terms[1:10], simplify = FALSE) hyp$pvalues[index2][1:10] @ \begin{Ex} The above code is an example of something that could easily become quite tedious if it had to be repeated more than one or two times. It is often worth the time and effort to write a small 'wrapper' function that will perform all the required steps to take e.g., a vector of Affymetrix probe IDs and output a \Robject{data.frame} containing the GO ID, GO term, p-value, the number of genes with that GO term in the set of significant genes, and the number on the chip. The last two bits of information are good to know because it is easy to get a very small p-value when there are few genes with a given GO term on the chip (e.g., if there are three genes with a GO term XXX, and one is in the list of significant genes, you will get a small p-value, but it is probably not an interesting result). If we have time (and you feel up to it), write a wrapper function to do this, outputting the results as a \Robject{data.frame}. \end{Ex} \section{Conclusion} %%inconsistent reference to cel files :-( %%Fixed We have demonstrated how one can use Bioconductor tools to go from the raw data in CEL files to annotated reports of a select group of interesting genes. We presented only one way of doing this. There are many alternatives also available via Bioconductor. We hope that this lab serves as an example of a useful analysis and also of the flexibility of Bioconductor. \bibliographystyle{plainnat} \bibliography{affylab.bib} \end{document}