% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{affycomp primer} %\VignetteKeywords{Preprocessing, Affymetrix} %\VignetteDepends{Biobase} %\VignetteDepends{affycompData} %\VignettePackage{affycomp} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \author{Rafael Irizarry and Leslie Cope} \begin{document} \title{Bioconductor Expression Assessment Tool for Affymetrix Oligonucleotide Arrays (affycomp)} \maketitle \tableofcontents \section{Introduction} <>= library(affycomp) library(affycompData) @ The defining feature of oligonucleotide expression arrays is the use of several probes to assay each targeted transcript. This is a bonanza for the statistical geneticist, offering a great opportunity to create probeset summaries with specific characteristics. There are now several methods available for summarizing probe level data from the popular Affymetrix GeneChips. It is harder to identify the method best suited to a given inquiry. This package provides a {\it graphical tool} for summaries of Affymetrix probe level data. It is the engine behind our webtool \url{http://affycomp.jhsph.edu/}. Plots and summary statistics offer a picture of how an expression measure performs in several important areas. This picture facilitates the comparison of competing expression measures and the selection of methods suitable for a specific investigation. The key is a benchmark dataset consisting of a dilution study and a spike-in study. Because the {\it truth} is known for this data, it is possible to identify statistical features of the data for which the expected outcome is known in advance. Those features highlighted in our suite of graphs are justified by questions of biological interest, and motivated by the presence of appropriate data. \subsection{Benchmark dataset} The spike-in benchmark data was originally free and easily available to the public, from Affymetrix. Copies in compressed-tar format may now be obtained here: \url{http://www.biostat.jhsph.edu/~ririzarr/affycomp/spikein.tgz} and \url{http://www.biostat.jhsph.edu/~ririzarr/affycomp/hgu133spikein.tgz}. The dilution benchmark data was originally also free, although not as easily available, from Gene Logic. For a full description of the benchmark data, see our {\it RMA} papers "Exploration, normalization, and summaries of high density oligonucleotide array probe level data", Biostatistics, 2003 Apr;4(2):249-64 (\url{http://www.ncbi.nlm.nih.gov/pubmed/12925520}) and "Summaries of Affymetrix GeneChip probe level data", Nucleic Acids Res, 2003 February 15; 31(4): e15 (\url{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC150247/}). In the Gene Logic dilution study, (\url{http://www.genelogic.com/support/scientific-studies/}), two sources of cRNA, human liver tissue and central nervous system cell line (CNS), were hybridized to human arrays (HG-U95Av2) in a range of dilutions and proportions. You must contact Gene Logic ({info@genelogic.com} or +1-800-GENELOGIC) to obtain the dilution data. In the Affymetrix spike-in study, different cRNA fragments were added to the hybridization mixture of the arrays at different picoMolar concentrations. The cRNAs were spiked-in at a different concentration on each array (apart from replicates) arranged in a cyclic Latin square design with each concentration appearing once in each row and column. All arrays had a common background cRNA. \subsection{phenodata} The package includes phenoData objects that give more details, \verb+dilution.phenodata+ and \verb+spikein.phenodata+ (also, \verb+hgu133a.spikein.phenodata+). \section{What's new in version 1.2?} A new set of assessments is provided by the function \verb+assessSpikeIn2+, which is a wrapper for \verb+assessMA2+, \verb+assessSpikeInSD+, and \verb+assessLS+. It will accept a full \verb+ExpressionSet+ of the spikein data (59 columns), as with \verb+assessSpikeIn+, but in this case only uses columns 1:13, 17, 21:33, and 37. Otherwise, it assumes that it has expression measures only for those particular 28 arrays. All spike-in related assessments now support the HGU133A chip, in addition to the HGU95A chip handled by version 1.1. The function \verb+read.newspikein+, which is simply a call to \verb+read.spikein+ with {\tt cdfName = "hgu133a"}, will read HGU133A expression measures. \section{The Image Report} Given a file named \verb+dilfilename.csv+ containing your dilution expression measures and a file named \verb+sifilename.csv+ containing your spikein expression measures, you can easily obtain the graphs and summary statistics in an image report (illustrated using RMA): \begin{Sinput} R> library(affycomp) ##load the package R> d <- read.dilution("dilfilename.csv") R> s <- read.spikein("sifilename.csv") R> rma.assessment <- affycomp(d, s, method.name="RMA") \end{Sinput} \verb+affycomp+ is a wrapper for \verb+assessAll+ and \verb+affycompTable+. The return value will contain all the information, from \verb+assessAll+, necessary to recreate the graphs (below) without having to re-do the assessment. For example, the following two objects were created by \verb+assessAll+ on \verb+ExpressionSet+s created by {\it MAS 5.0} and RMA, respectively; they are lists of lists. <<>>= data(mas5.assessment) data(rma.assessment) @ <<>>= names(mas5.assessment) @ Each component is the result of a specific assessment. The names tell us what they are for. \verb+Dilution+ are the assessment based on the dilution data and can be used to create Figures 2, 3, and 4b. \verb+MA+ has the necessary information for the MA plot or Figure 1. \verb+Signal+ has the necessary information to create Figure 4a. \verb+FC+ has assessments related to fold change and can be used to create Figures 5a, 6a, and 6b. Finally, \verb+FC2+ has the necessary information to create Figure 5b. The captions for these Figures will give you an idea of what they are for. There are two kinds of plots, basic and comparative. The basic plots depict a given expression measure. In the comparative plots, the given expression measure is compared to other measures, MAS 5.0 by default. Tables are also automatically created with assessment statistics. Finally, a simple assessment of standard error estimates can be done. These are all described in the following subsections. \subsection{Basic plots} You can use \verb+affycompPlot+ which will automatically know what to do, or you can use the auxiliary figure functions that will need to have a specific assessment list. % HJ - 1/13/2011 % To control their placement, which was wrong, all figures are now % done in-line, each on its own forced newpage. The original figure % structure is left, commented out, in case someone can make it work. \newpage %\begin{figure}[htbp] \begin{center} <>= affycompPlot(mas5.assessment$MA) @ \end{center} {Figure 1: The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal $\log_2$ fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= affycomp.figure2(mas5.assessment$Dilution) @ \end{center} {Figure 2: For each gene, and each experimental condition, we calculate the mean log expression and the observed standard deviation across 5 replicates. The resulting scatterplot is smoothed to generate a single curve representing mean standard deviation as a function of mean log expression.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= affycomp.figure3(mas5.assessment$Dilution) @ \end{center} {Figure 3: This plot, using the Gene Logic dilution data, shows the sensitivity of fold change calculations to total RNA abundance. Average log fold-changes between liver and CNS for the lowest concentration and the highest in the dilution data set are computed. Orange and red color is used to denote genes with $M_{6g}-M_{1g}$ bigger than $\log_2(2)$ and $\log_2(3)$ respectively. The rest are denoted with black.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.figure4a(mas5.assessment$Signal) affycomp.figure4b(mas5.assessment$Dilution) @ \end{center} {Figure 4: a) Average observed $log_2$ intensity plotted against nominal $log_2$ concentration for each spiked-in gene for all arrays in Affymetrix spike-In experiment. b) For the Gene Logic dilution data, log expression values are regressed against their log nominal concentration. The slope estimates are plotted against average log intensity across all concentrations.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.figure5a(mas5.assessment$FC) affycomp.figure5b(mas5.assessment$FC) @ \end{center} {Figure 5: A typical identification rule for differential expression filters genes with fold change exceeding a given threshold. This figure shows average ROC curves which offer a graphical representation of both specificity and sensitivity for such a detection rule. a) Average ROC curves based on comparisons with nominal fold changes ranging from 2 to 4096. b) As a) but with nominal fold changes equal to 2.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.figure6a(mas5.assessment$FC) affycomp.figure6b(mas5.assessment$FC) @ \end{center} {Figure 6: a) Observed log fold changes plotted against nominal log fold changes. The dashed lines represent highest, 25th highest, 100th highest, 25th percentile, 75th percentile, smallest 100th, smallest 25th, and smallest log fold change for the genes that were not differentially expressed. b) Like a) but the observed fold changes were calculated for spiked in genes with nominal concentrations no higher than 2pM.} %\end{figure} \subsection{Comparative plots} You can use \verb+affycompPlot+ which will automatically know what to do, or you can use the auxiliary figure functions that will need to have a specific assessment list. \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycompPlot(mas5.assessment$MA, rma.assessment$MA) @ \end{center} {Figure 1: The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal $\log_2$ fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= affycomp.compfig2(list(mas5.assessment$Dilution, rma.assessment$Dilution), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 2: For each gene, and each experimental condition, we calculate the mean log expression and the observed standard deviation across 5 replicates. The resulting scatterplot is smoothed to generate a single curve representing mean standard deviation as a function of mean log expression.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= affycomp.compfig3(list(mas5.assessment$Dilution, rma.assessment$Dilution), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 3: This plot, using the Gene Logic dilution data, shows the sensitivity of fold change calculations to total RNA abundance. Average log fold-changes between liver and CNS for the lowest concentration and the highest in the dilution data set are computed. Orange and red color is used to denote genes with $M_{6g}-M_{1g}$ bigger than $\log_2(2)$ and $\log_2(3)$ respectively. The rest are denoted with black.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.compfig4a(list(mas5.assessment$Signal, rma.assessment$Signal), method.names=c("MAS 5.0","RMA")) affycomp.compfig4b(list(mas5.assessment$Dilution, rma.assessment$Dilution), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 4: a) Average observed $log_2$ intensity plotted against nominal $log_2$ concentration for each spiked-in gene for all arrays in Affymetrix spike-In experiment. b) For the Gene Logic dilution data, log expression values are regressed against their log nominal concentration. The slope estimates are plotted against average log intensity across all concentrations.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.compfig5a(list(mas5.assessment$FC, rma.assessment$FC), method.names=c("MAS 5.0","RMA")) affycomp.compfig5b(list(mas5.assessment$FC2, rma.assessment$FC2), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 5: A typical identification rule for differential expression filters genes with fold change exceeding a given threshold. This figure shows average ROC curves which offer a graphical representation of both specificity and sensitivity for such a detection rule. a) Average ROC curves based on comparisons with nominal fold changes ranging from 2 to 4096. b) As a) but with nominal fold changes equal to 2.} %\end{figure} \newpage %\begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,2)) affycomp.figure6a(mas5.assessment$FC, main="a) MAS 5.0", ylim=c(-12,12)) affycomp.figure6a(rma.assessment$FC, main="a) RMA", ylim=c(-12,12)) affycomp.figure6b(mas5.assessment$FC, main="b) MAS 5.0", ylim=c(-6,6)) affycomp.figure6b(rma.assessment$FC, main="b) RMA", ylim=c(-6,6)) @ \end{center} {Figure 6: a) Observed log fold changes plotted against nominal log fold changes. The dashed lines represent highest, 25th highest, 100th highest, 25th percentile, 75th percentile, smallest 100th, smallest 25th, and smallest log fold change for the genes that were not differentially expressed. b) Like a) but the observed fold changes were calculated for spiked in genes with nominal concentrations no higher than 2pM.} %\end{figure} \subsection{Tables} The function \verb+tableAll+ returns a matrix with assessment statistics. Once the assessment function is run, all one needs to type is <<>>= data(rma.assessment) data(mas5.assessment) tableAll(rma.assessment, mas5.assessment) @ \newpage The function \verb+affycompTable+ makes a minimal table (that is also informative). <<>>= affycompTable(rma.assessment, mas5.assessment) @ \subsection{Standard deviation assessment} The package also contains a simple tool to assess standard error estimates. For this to work the \verb+ExpressionSet+ object used for the assessment must have standard error estimates for the dilution data. We include two examples. <<>>= data(rma.sd.assessment) data(lw.sd.assessment) tableAll(rma.sd.assessment, lw.sd.assessment) @ For the SD assessment, there is also a comparison plot in addition to the basic plot. See \verb+affycompPlot+ (and \verb+affycomp.compfig7+ or \verb+affycomp.figure7+). \newpage %\begin{figure}[htbp!] \begin{center} <>= affycompPlot(lw.sd.assessment, rma.sd.assessment) @ \end{center} {Figure 7: Using the replicates from the dilution data, we calculate the mean predicted variance for each gene, tissue, and dilution by squaring the estimated standard error. The usual sample variances $s^2_{tdg} = \sum_r (y_{tdrg} - y_{td\cdot g})^2/4$ are calculated as well. These boxplots are of the log ratios of the predicted and observed variance. } %\end{figure} \newpage %\begin{figure}[htbp!] \begin{center} <>= affycompPlot(lw.sd.assessment) @ \end{center} {Figure 7: Using the replicates from the dilution data, we calculate the mean predicted variance for each gene, tissue, and dilution by squaring the estimated standard error. The usual sample variances $s^2_{tdg} = \sum_r (y_{tdrg} - y_{td\cdot g})^2/4$ are calculated as well. A scatterplot of the log ratios of the predicted and observed variance, against mean log expression. } %\end{figure} \section{The end} Enjoy! \end{document}