% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{SScore primer} %\VignetteKeywords{Analysis, Affymetrix} %\VignetteDepends{sscore, affy, affyio, affydata} %\VignettePackage{sscore} \documentclass[12pt]{article} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in %\headheight=-.3in %\newcommand{\scscst}{\scriptscriptstyle} %\newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \author{Richard E. Kennedy, Kellie J. Archer,\\ Robnet T. Kerns, and Michael F. Miles} \begin{document} \title{Description of S-Score: Expression Analysis of Affymetrix GeneChips from Probe-Level Data} \maketitle \tableofcontents \newpage \section{Introduction} The S-Score algorithm described by \cite{zhang2002} and \cite{kerns2003} is a novel comparative method for gene expression data analysis that performs tests of hypotheses directly from probe level data. It is based on a new error model in which the detected signal is assumed to be proportional to the probe pair signal for highly expressed genes, but assumed to approach a background level (rather than 0) for genes with low levels of expression. This model is used to calculate relative change in probe pair intensities that converts probe signals into multiple measurements with equalized errors, which are summed over a probe set to form the significance score (S-Score). Assuming no expression differences between chips, the S-Score follows a standard normal distribution. Thus, p-values can be easily calculated from the S-Score, and a separate step estimating the probe set expression summary values is not needed. Furthermore, in comparisons of dilution and spike-in microarray datasets, the S-Score demonstrated greater sensitivity than many existing methods, without sacrificing specificity \citep{kennedy2006}. The \Rpackage{sscore} package \citep{kennedy2006b} implements the S-Score algorithm in the R programming environment, making it available to users of the Bioconductor\footnote{\url{http://www.bioconductor.org/}} project. \section{What's new in this version} This release has minor changes for compatibility with the \Robject{ExpressionSet} data class, as well as minor bug fixes. \section{Reading in data and generating S-Scores} Affymetrix data are generated from GeneChips\textregistered\ by analyzing the scanned image of the chip (stored in a *.DAT file) to produce a *.CEL file. The *.CEL file contains, among other information, a decimal number for each probe on the chip that corresponds to its intensity. The S-Score algorithm compares two GeneChips by combining all of the probe intensities from a probeset (typically 11 to 20) into a single summary statistic for each gene. The \Rpackage{sscore} package processes the data obtained from *.CEL files, which must be loaded into R prior to calling the \Rfunction{SScore} function. Thus, the typical sequence of steps to accomplish this is as follows: \begin{enumerate} \item Create a directory containing all *.CEL files relevant to the planned analysis. \item If using Linux / Unix, start R in that directory. \item If using the Rgui for Microsoft Windows, make sure your working directory contains the *.CEL files (use ``File -> Change Dir'' menu item). \item Load the library. <>= library(sscore) options(width=60) library(affydata) @ \item Read in the data and create an expression set. \end{enumerate} Both of the functions \Rfunction{SScore} and \Rfunction{SScoreBatch} operate on an \Robject{AffyBatch} object containing all of the relevant information from the *.CEL files. Additional information regarding the \Rfunction{ReadAffy} function and detailed description of the structure of *.CEL files can be found in the \Rpackage{affy} vignette. Note that, even though the intensities have been loaded into R, \Rfunction{SScore} will still need direct access to the *.CEL files later to obtain the information about outliers. {\bf If a copy of the *.CEL files is not available when } \Rfunction{SScore} {\bf is called, an error may result.} The \Rfunction{SScore} and \Rfunction{SScoreBatch} functions return an object of class \Robject{ExpressionSet}. (The class \Robject{ExpressionSet} is described in the \Rpackage{Biobase} vignette.) The S-Score values are returned in the \Robject{exprs} slot. The following examples illustrate the \Rpackage{sscore} package with the results of the S-Score analysis for the \Robject{Dilution} data set included with the \Rpackage{affydata} package. Due to the nature of this dataset, *.CEL files are not included and computation fo the SF and SDT data (as described below) cannot be performed. %These examples utilize the data files 20A, 20B, 10A, and 10B supplied with the %\Rpackage{sscore} package, %which are *.CEL files corresponding to the data in the \Robject{Dilution} data %set. A basic S-Score analysis is generated using the \Rfunction{SScore} function: <>= data(Dilution) ## get the example data ## get the path to the package directory pathname <- system.file("doc",package="sscore") cel <- Dilution[,c(1,3)] ## only need the first 2 samples per condition SScore.basic <- SScore(cel,celfile.path=pathname, SF=c(4.46,5.72),SDT=c(57.241,63.581),rm.extra=FALSE) @ and the first few S-Score values are <<>>= exprs(SScore.basic)[1:20] @ Optional parameters for \Rfunction{SScore} include: \begin{description} \item[celfile.path] -- character string giving the directory in which the *.CEL files are stored. If a directory is not specified, the current working directory is used. \item[celfile.names] -- character vector giving the filenames of the *.CEL files corresponding to the columns of the \Robject{AffyBatch} object. If filenames are not specified, the sample names of the \Robject{AffyBatch} object are used. \item[SF, SDT] -- the Scale Factor and Standard Difference Threshold. Each is a vector with length equal to the number of columns in the AffyBatch object, and contains a numeric value for each chip. The Scale Factor is used to scale each intensity to a target background value, with the default of 500 (as used by the Affymetrix GeneChip Operating Software [GCOS]). The Standard Difference Threshold is used as an estimate of background noise, and is equal to the standard deviation for the lowest 2\% of intensities on a chip. These values are available from the Affymetrix GCOS output, or may be calculated by the \Rfunction{SScore} function. %An example of an S-Score analysis in which SF and SDT were specified is % %<>= %data(Dilution) %pathname <- system.file("doc",package="sscore") %cel <- Dilution[,c(1,2)] %SScore.sfsdt <- SScore(cel,SF=c(22.24,25.49), %SDT=c(2526.300,2590.297),celfile.path=pathname,rm.extra=FALSE) %@ % %and the first few S-Score values are % %<<>>= %exprs(SScore.sfsdt)[1:20] %@ \item[rm.outliers, rm.mask, rm.extra] -- These are logical values used to exclude certain probes from the S-Score calculations. These options perform the same as they do in the \Rfunction{ReadAffy} function, which it calls. \verb+rm.outliers+ excludes all probes designated as outliers in the *.CEL file. \verb+rm.mask+ excludes all probes designated as masked in the *.CEL file. \verb+rm.extra+ removes both outlier and mask probes, and overrides \verb+rm.outliers+ and \verb+rm.mask+ if these are specified. %An example of an S-Score analysis in which outliers were included is % %<>= %data(Dilution) %pathname <- system.file("doc",package="sscore") %cel <- Dilution[,c(1,3)] %SScore.outliers <- SScore(cel,rm.outliers=FALSE, %rm.mask=FALSE,celfile.path=pathname,SF=c(4.46,5.72), %SDT=c(57.241,63.581),rm.extra=FALSE) %@ % %and the first few S-Score values are % %<<>>= %exprs(SScore.outliers)[1:20] %@ \item[digits] -- a numeric value that specifies the number of significant decimal places for the S-Score and CorrDiff values, which are rounded as needed. The default uses full precision with no rounding. The output from the stand-alone version of the S-Score uses \verb+digits=3+. %The example \Robject{SScore.digits} contains the results of a S-Score analysis %in which only 3 %significant digits were retained: % %<>= %data(Dilution) %pathname <- system.file("doc",package="sscore") %cel <- Dilution[,c(1,3)] %SScore.digits <- SScore(cel,digits=3,celfile.path=pathname, %SF=c(4.46,5.72),SDT=c(57.241,63.581),rm.extra=FALSE) %@ % %and the first few S-Score values are % %<<>>= %exprs(SScore.digits)[1:28] %@ % \item[verbose] -- a logical value indicating whether additional information on the analyses is printed. This includes the chip type, sample names, values of alpha and gamma, and the SF and SDT values. %<<>>= %data(Dilution) %pathname <- system.file("doc",package="sscore") %cel <- Dilution[,c(1,3)] %SScore.sfsdt <- SScore(cel,SF=c(4.46,5.72),SDT=c(57.241,63.581), %verbose=TRUE,celfile.path=pathname,rm.extra=FALSE) %@ % \end{description} \section{Multichip comparisons} Beginning with release 1.7.0, the \Rfunction{SScore} function is capable of comparing two classes where each class includes replicates. As with previous releases, only two class comparisons are available. The multichip comparisons are performed by adding a \verb+classlabel+ vector which distinguishes classes, similar to that of the \Rpackage{multtest} package. The vector \verb+classlabel+ describes to which class each GeneChip belongs. Its length is equal to the number of chips being compared, with each element containing either a 0 or a 1, indicating class assignment. Thus, the assignment <>= labels <- c(0,0,0,1,1,1) @ would compare the first three chips to the last three chips. (Note that the number of chips in the two classes being compared do not have to be equal.) If the \verb+classlabel+ parameter is not specified, it defaults to a two-chip comparison for compatibility with previous versions of \Rfunction{SScore}. An example of a multichip S-Score comparison would be <>= data(Dilution) pathname <- system.file("doc",package="sscore") cel <- Dilution SScore.multi <- SScore(cel,classlabel=c(0,0,1,1), SF=c(4.46,6.32,5.72,9.22),SDT=c(57.241,53.995,63.581, 69.636),celfile.path=pathname,rm.extra=FALSE) @ and the first few S-Score values are <<>>= exprs(SScore.multi)[1:20] @ The other parameters of \Rfunction{SScore} remain unchanged. The output data from the multichip comparison are still standard S-Scores, i.e., they still follow a Normal(0,1) distribution and may be converted to p-values as described below. \section{Multiple pairwise comparisons} Previous versions of the \Rfunction{SScore} function calculated the S-Score values for one pair of chips (i.e. a single two-chip comparison). However, for many experiments, several chips need to be compared. This can be done using the \Rfunction{SScoreBatch} function, which automates the process of making several two-chip comparisons. The setup and options for \Rfunction{SScoreBatch} are very similar to \Rfunction{SScore}. The \Rfunction{SScoreBatch} function has an additional parameter, the \Robject{compare} matrix, which specifies the pairs of chips to compare. It is an N x 2 matrix, where N is the number of comparisons being made. Each row contains the column number of the chips in the \Robject{AffyBatch} object that are being compared. For example, if the \Robject{compare} matrix is set up as \begin{verbatim} [1,] [2,] [,1] 2 5 [,2] 2 6 [,3] 5 9 [,4] 10 2 [,5] 5 7 [,6] 10 8 [,7] 9 4 [,8] 1 2 [,9] 3 10 \end{verbatim} The first comparison made is between the chips in columns 2 and 5 of the \Robject{AffyBatch} object; the second comparison made is between the chips in columns 2 and 6; the third comparison made is between the chips in columns 5 and 9; and so forth. If the \Robject{compare} matrix has more than two columns, only the first two columns will be used for identifying the GeneChips in the \Robject{AffyBatch} object to be compared. Each column of \Robject{eset} will contain the results of a single two-chip comparison. The first column of \Robject{eset} will contain the comparison corresponding to the first row of the \Robject{compare} matrix, the second column of \Robject{eset} will contain the comparison corresponding to the second row of the \Robject{compare} matrix, and so forth. A basic S-Score analysis using \Rfunction{SScoreBatch} is generated using the commands: <>= data(Dilution) pathname <- system.file("doc",package="sscore") compare <- matrix(c(1,2,1,3,1,4),ncol=2,byrow=TRUE) SScoreBatch.basic <- SScoreBatch(Dilution,compare=compare, SF=c(4.46,6.32,5.72,9.22),SDT=c(57.241,53.995,63.58,169.636), celfile.path=pathname,rm.extra=FALSE) @ and the first few S-Score values are <<>>= exprs(SScoreBatch.basic)[1:10,] @ %The example \Robject{SScoreBatch.sfsdt} contains the results of an analysis in %which the SF and SDT %values were specified % %<>= %data(Dilution) %pathname <- system.file("doc",package="sscore") %compare <- matrix(c(1,2,1,3,2,3),ncol=2,byrow=TRUE) %SScoreBatch.sfsdt <- SScoreBatch(Dilution,compare=compare, %SF=c(22.24,25.49,25.56),SDT=c(2526.300, %2590.297,2751.634),celfile.path=pathname) %@ % %and the first few S-Score values are % %<<>>= %exprs(SScoreBatch.sfsdt)[1:10,] %@ % Other parameters for \Rfunction{SScoreBatch} are identical to \Rfunction{SScore}. \section{Using S-Scores in gene expression analysis} Under conditions of no differential expression, the S-Scores follow a standard normal (Gaussian) distribution with a mean of 0 and standard deviation of 1. This makes it straightforward to calculate p-values corresponding to rejection of the null hypothesis and acceptance of the alternative hypothesis of differential gene expression. Cutoff values for the S-Scores can be set to achieve the desired level of significance. As an example, an absolute S-Score value of 3 (signifying 3 standard deviations from the mean, a typical cutoff value) would correspond to a p-value of 0.003. Under this scenario, the significant genes can be found as: <<>>= sscores <- exprs(SScore.basic) ## extract the S-Score values ## find those greater than 3 SD signif <- geneNames(Dilution)[abs(sscores) >= 3] @ Similarly, the p-values can be calculated as: <<>>= sscores <- exprs(SScore.basic) ## extract the S-Score values p.values.1 <- 1 - pnorm(abs(sscores)) ## find the corresponding ## one-sided p-values p.values.2 <- 2*(1 - pnorm(abs(sscores))) ## find the corresponding ## two-sided p-values @ The S-Score algorithm does account for the correlations among probes within a two-chip comparison. However, it does not adjust p-values for multiple comparisons when comparing more than one pair of chips. \section{Computing scale factor and statistical difference threshold} The \Rfunction{SScore} and \Rfunction{SScoreBatch} functions call the function \Rfunction{computeSFandSDT} to compute the values for the Scale Factor (SF) and Statistical Difference Threshold (SDT) if these are not supplied by the user. \Rfunction{computeSFandSDT} is an internal function that generally will not be called or modified. The calculations for the SF and SDT are performed as described in the Affymetrix Statistical Algorithms Description Document \citep{affy:tech:2002} and implemented in the Affymetrix (using SDT = 4 * RawQ * SF). The calculation of these values can be both time- and memory-intensive; it is recommended that the user supply these values from the Affymetrix MAS5 or GCOS Metrics table whenever possible. Alternatively, \Rfunction{computeSFandSDT} may be called directly to obtain the SF and SDT values for each *.CEL file, which are then supplied by the user in subsequent calls to \Rfunction{SScore}. The calculations for each *.CEL file are independent. If memory is not sufficient to allow computation of all SF and SDT values simultaneously, the *.CEL files may be broken into smaller batches; identical results will be obtained either way. In addition to computing the specified values, \Rfunction{computeSFandSDT} may be used to generate histograms of the log intensities for the chips being compared. Such plots are useful for identifying potentially problematic chips prior to analysis. It may also be used to display additional information about the *.CEL file parameters. The options for \Rfunction{computeSFandSDT} are \begin{description} \item[TGT] -- a numeric value for the target intensity to which the arrays should be scaled. \item[verbose] -- a logical value indicating whether additional information on the calculations is printed. This includes the SF, SDT, and RawQ values, as well as descriptive statistics on the background and noise. This is similar to the information provided by the Affymetrix GCOS Metrics table for the *.CEL file. \item[plot.histogram] -- a logical value indicating whether a histogram should be plotted. Both the PM and MM log intensities will be shown in a single graphics window. Separate plots will be generated for each chip being analyzed. \item[digits] -- a numeric value that specifies the number of significant decimal places for the SF and SDT values, which are rounded as needed. Using \verb+digits=3+ rounds to the same number of digits as the stand-alone version of the S-Score. \item[celfile.path] -- character string specifying the directory for *.CEL files \end{description} \Rfunction{computeSFandSDT} requires that the *.CEL files be in text format. The alternate function \Rfunction{computeAffxSFandSDT} expects information obtained from the \Rpackage{affxparser} routines, so that either text or binary files may be used. In addition to the options for \Rfunction{computeSFandSDT}, \Rfunction{computeAffxSFandSDT} has the following required parameters: \begin{description} \item[stdvs] -- a vector of standard deviations of the probe intensities (which can be read using the \verb+readStdvs=TRUE+ option in the \Rpackage{affxparser} function \Rfunction{readCel}). \item[pixels] -- a vector of the number of pixels used in calculating the probe intensity (which can be read using the \verb+readStdvs=TRUE+ option in the \Rpackage{affxparser} function \Rfunction{readCel}). \end{description} \section{Identifying outliers} The current version of the \Rfunction{SScore} and \Rfunction{SScoreBatch} functions use the information contained in the *.CEL files to flag probes as outliers that should be excluded from the S-Score calculation. In previous versions, this was accomplished using the \Rfunction{computeOutlier} function, which is retained for compatibility. This is an internal function that generally will not be called or modified. The \Rfunction{computeOutlier} function was called if the \verb+rm.outliers+, \verb+rm.mask+, or \verb+rm.extra+ parameters of \Rfunction{SScore} or \Rfunction{SScoreBatch} are set to \verb+TRUE+. These parameters work as described in the \Rpackage{affy} documentation since they are passed to the \Rfunction{ReadAffy} function to identify outlier and mask probes. The return value from \Rfunction{computeOutlier} is a logical matrix the same size and order as the intensity matrix for the \Robject{AffyBatch} object. Each cell of the logical matrix contains a \verb+TRUE+ value if the corresponding intensity is identified as an outlier and excluded from the S-Score calculation; otherwise it contains \verb+FALSE+. \section{Changes from the stand-alone version} The S-Score algorithm has been previously implemented as a stand-alone executable for the Windows operating system, using Borland Delphi. This version has been available from the Miles Laboratory at \url{http://www.brainchip.vcu.edu/expressionda.htm}. Users of the stand-alone version will notice small differences in results compared to the \Rpackage{sscore} package as it is implemented in R, though these should not significantly affect inferences regarding gene expression. The following lists identifies differences between the two implementations: \begin{enumerate} \item The stand-alone version excludes outlier, masked, and modified intensities from calculations when using *.CEL files. When using *.CSV files, the stand-alone program also excludes outlier, masked, and modified intensities {\it if the corresponding *.CEL file is present for obtaining this information}. (The *.CSV file does not contain any information about which intensities are outlier, masked, or modified.) The default for the R package is {\it not} to exclude outlier, masked, or modified intensities, though this may be changed using various options. Note that, due to the way the \Rpackage{affy} package is implemented, it is not possible to exclude modified intensities using the \Rpackage{sscore} package. \item The rounding methods are not identical for Borland Delphi and R, which can lead to slight differences in calculations. The difference is negligible for most of the S-Score calculations, and should be less than or equal to 0.001. \item The SF and SDT calculations in the stand-alone version are performed using an independently developed algorithm. The original C++ version uses natural logarithms, while the Delphi version uses base 10 logarithm. The \Rpackage{sscore} package uses a ported version of the Affymetrix algorithms described on the Affymetrix website \url{http://www.affymetrix.com} under Support -> Developer's Network -> Open Source -> MAS5 Stat SDK. Base 2 logarithms are used for these calculations. \end{enumerate} A Java version of the S-Score algorithm is also under development. Differences between the Java version and the \Rpackage{sscore} package will be included after the Java version is released. \section{Version history} \begin{description} \item[1.7.0] added routines to compute S-Scores for replicate chips within a 2-class comparison. Also updated functions to operate on the new \Rfunction{ExpressionSet} class, and changed routines for reading of binary *.CEL files from \Rpackage{affxparser} to \Rpackage{affyio} due to stability problems with the former on the Macintosh PowerPC platform. \item[1.5.4] incorporated routines from the \Rpackage{affxparser} package for reading of binary *.CEL files. Added option to specify *.CEL file names in the \Rfunction{SScore} and \Rfunction{SScoreBatch} functions. \item[1.4.2] corrected a bug resulting in too many open file handles for large \Robject{AffyBatch} objects. \item[1.4.1] corrected a bug in assigning column names to \Robject{exprSet} object \item[1.4.0] first public release \item[1.0.0] initial development version \end{description} \section{Acknowledgements} The development of the S-Score algorithm and its original implementation in C++ is the work of Dr. Li Zhang. The Delphi implementation of the S-Score algorithm is the work of Dr. Robnet Kerns. This work was partly supported by NLM F37 training grant LM008728 to Richard E. Kennedy and NIAAA research grant AA13678 to Michael F. Miles. \bibliographystyle{plainnat} \bibliography{sscore} \end{document}