% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{HowTo: Get HTML Output} % \VignetteDepends{annotate, hgu95av2.db} % \VignetteKeywords{Expression Analysis, Annotation} % \VignettePackage{annotate} \documentclass[11pt]{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage[authoryear,round]{natbib} \bibliographystyle{plainnat} \usepackage{hyperref} \begin{document} \title{HowTo: get pretty HTML output for my gene list} \author{James W. MacDonald} \maketitle{} \section{Overview} The intent of this vignette is to show how to make reasonably nice looking HTML tables for presenting the results of a microarray analysis. These tables are a very nice format because you can insert clickable links to various public annotation databases, which facilitates the downstream analysis. In addition, the format is quite compact, can be posted on the web, and can be viewed using any number of free web browsers. One caveat; an HTML table is probably not the best format for presenting the results for \emph{all} of the genes on a chip. For even a small (5000 gene) chip, the file could be 10 Mb or more, which would take an inordinate amount of time to open and view. Also note that the Bioconductor project supplies annotation packages for many of the more popular Affymetrix chips, as well as for many commercial spotted cDNA chips. For chips that have annotation packages, the \Rpackage{annaffy} package is the preferred method for making HTML tables. To make an annotated HTML table, the only requirement is that we have some sort of annotation data for the microarray that we are using. Most manufacturers supply data in various formats that can be read into \Rpackage{R}. For instance, Affymetrix supplies CSV files that can be read into \Rpackage{R} using the \Rmethod{read.csv()} function \url{http://www.affymetrix.com/support/technical/byproduct.affx?cat=arrays}. \section{Alternate methods} Please note that one can also make these HTML tables by parsing data from e.g., an online (or local) Biomart database, using functions in the biomaRt package. This may be easier, and may result in more current annotation data. Please see the prettyOutput vignette in the biomaRt package for more information. \section{Data Analysis} I will assume that the reader is familiar with the analysis of microarray data, and has a set of genes that she would like to use. In addition, I will assume that the reader is familiar enough with \Rpackage{R} that she can subset the data based on a list of genes, and reorder based on a particular statistic. For any questions about subsetting or ordering data, please see ``An Introduction to R''. For questions regarding microarray analysis, please consult the vignettes for, say \Rpackage{limma}, \Rpackage{multtest}, or \Rpackage{marray}. \section{Getting Started} We first load the \Rpackage{annotate} package, as well as some data. These data will be from the Affymetrix HG-U95Av2 chip (for which we would normally use \Rpackage{annaffy}). To keep the HTML table small, we will take a subset of fifteen genes as an example. <>= options(width=70) @ <<>>= library("annotate") data(sample.ExpressionSet) igenes <- featureNames(sample.ExpressionSet)[246:260] @ \section{Annotation Data} <>= ug <- c("Hs.169284 // ---", "Hs.268515 // full length", "Hs.103419 // full length", "Hs.380429 // ---" ,"--- // ---", "Hs.169331 // full length", "Hs.381231 // full length", "Hs.283781 // full length", "--- // ---", "--- // ---", "Hs.3195 // full length", "--- // ---", "Hs.176660 // full length", "Hs.272484 // full length", "Hs.372679 // full length") ll <- c("221823", "4330", "9637", "---", "---", "6331", "841", "27335", "---", "---", "6375", "---", "2543", "2578", "2215") gb <- c("M57423", "Z70218", "L17328", "S81916", "U63332", "M77235", "X98175", "AB019392", "J03071", "D25272", "D63789", "D63789", "U19142", "U19147", "X16863") sp <- c("P21108", "Q10571", "Q9UHY8", "Q16444", "---", "Q14524 /// Q8IZC9 /// Q8WTQ6 /// Q8WWN5 /// Q96J69", "Q14790", "Q9UBQ5", "---", "---", "P47992", "---", "Q13065 /// Q8IYC5", "Q13070", "O75015") @ For this vignette I have supplied the annotation data. In a normal situation, these data would be subset from the manufacturer's annotation data, using the manufacturer's gene identifiers (which is how I got these IDs). First, we will look at the GenBank and LocusLink IDs. We will be able to use these IDs without further modification. Note that the LocusLink IDs contain some missing data (``---''). This will not pose a problem because LocusLink IDs are all numeric, so we have incorporated code in \Rmethod{htmlpage()} to automatically convert any non-numeric ID to an HTML empty cell character (``\ ''). GenBank IDs (which often correspond to either RefSeq or GenBank IDs) are not as consistent, so any missing data would have to be manually converted to the HTML empty cell character. Missing data for LocusLink, UniGene and OMIM IDs are automatically converted, whereas Affymetrix, SwissProt and GenBank IDs have to be done manually. I will give examples of how to do this below. <<>>= gb ll @ The UniGene and SwissProt IDs present different challenges, so we will modify them separately. For the UniGene IDs we need to strip off the extra information appended to each ID. If we didn't do this, our hyperlink would not work correctly. <<>>= ug ug <- sub(" //.*$", "", ug) ug @ The SwissProt IDs present a different challenge. Here there isn't any extra information. Instead, we have multiple IDs for some of the genes, and missing IDs for some of the others. Because the code for SwissProt IDs will not automatically handle missing data, we have to convert the missing data to an HTML empty cell identifier (``\ ''). For \Rmethod{htmlpage()} to correctly handle multiple IDs, we have to convert the character vector into a \emph{list} of character vectors. <<>>= sp sp <- strsplit(sub("---"," ",as.character(sp)), "///") sp @ We have converted the data to a list of character vectors, and also converted the ``---'' missing data identifier to the HTML character for an empty cell. \section{Build the Table} Usually we would like to include the expression values for our genes along with some statistics, say a $t$-statistic, fold change, and $p$-value. As an example, we will make a comparison using the first ten samples. <>= dat <- exprs(sample.ExpressionSet)[igenes,1:10] FC <- rowMeans(dat[igenes,1:5]) - rowMeans(dat[igenes,6:10]) pval <- esApply(sample.ExpressionSet[igenes,1:10], 1, function(x) t.test(x[1:5], x[6:10])$p.value) tstat <- esApply(sample.ExpressionSet[igenes,1:10], 1, function(x) t.test(x[1:5], x[6:10])$statistic) @ It is also usually a good idea to include gene names in the table. Normally the names would be subsetted from the annotation data, but here we have to supply them. Again, we have to manually convert any missing names to the HTML empty cell character. <>= name <- c("hypothetical protein LOC221823", "meningioma (disrupted in balanced translocation) 1", "fasciculation and elongation protein zeta 2 (zygin II)", "Phosphoglycerate kinase {alternatively spliced}", "---","sodium channel, voltage-gated, type V, alpha polypeptide", "caspase 8, apoptosis-related cysteine protease","muscle specific gene","---","---","chemokine (C motif) ligand 1", "---","G antigen 1","G antigen 6","Fc fragment of IgG, low affinity IIIb, receptor for (CD16)") @ <<>>= name name <- gsub("---", " ", name) name @ We can now build our HTML table. To make the process more transparent, this will be done in steps. In practice however, this can be done in one line. Note here that the genelist consists of annotation data that will be hyperlinked to online databases, whereas othernames consists of other data that will not be hyperlinked. <>= genelist <- list(igenes, ug, ll, gb, sp) filename <- "Interesting_genes.html" title <- "An Artificial Set of Interesting Genes" othernames <- list(name, round(tstat, 2), round(pval, 3), round(FC, 1), round(dat, 2)) head <- c("Probe ID", "UniGene", "LocusLink", "GenBank", "SwissProt", "Gene Name", "t-statistic", "p-value", "Fold Change", "Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", "Sample 6", "Sample 7", "Sample 8", "Sample 9", "Sample 10") repository <- list("affy", "ug", "en", "gb", "sp") htmlpage(genelist, filename, title, othernames, head, repository = repository) @ \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \end{document}