%\VignetteDepends{utils} %\VignetteIndexEntry{An R Package for Estimating Gene Expressions using Multiple Scans} %\VignetteKeywords{Preprocessing} %\VignetteKeywords{Gene expressions} %\VignettePackage{multiscan} \documentclass[11pt]{article} \usepackage{epsfig,fullpage} \usepackage{graphicx} \usepackage{hyperref} \parindent 0in \begin{document} \title{\bf {\tt multiscan}: Combining multiple laser scans of microarrays} \author{Mizanur R. Khondoker\footnote {Package maintainer, Email: \texttt {Mizanur.Khondoker@ed.ac.uk}} \footnote{Division of Pathway Medicine, The University of Edinburgh Medical School, The Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, UK.} \and Chris A. Glasbey \footnote{Biomathematics \& Statistics Scotland, King's Buildings, Edinburgh, EH9 3JZ, Scotland, UK. Email:\texttt{chris@bioss.ac.uk.}} \and Bruce J. Worton \footnote {School of Mathematics, University of Edinburgh, King's Buildings, Edinburgh, EH9 3JZ, Scotland, UK. Email: \texttt {Bruce.Worton@ed.ac.uk.} }} \maketitle \tableofcontents \section{Introduction} The sensitivity level of microarray scanners is adjustable and plays a crucial role in getting reliable measurement of the fluorescence intensity. A change in scanner setting transforms the intensity measurements by a multiplicative constant. A scanner's sensitivity has to be raised to a certain level to ensure that the intensity levels of weakly expressed genes exceed the intrinsic noise level of the scanner and so become measurable. This may, however, cause another problem: signal censoring for highly expressed genes. Scanners cannot record pixel intensities above some software dependent threshold ($2^{16} - 1 = 65535$, for a 16-bit computer storage system), so highly expressed genes can have pixel values which are right censored at the largest possible value that the scanner software allows. It is not usually possible to find a scanner setting which is optimal for both weakly and highly expressed genes. So, it seems reasonable to consider multiple scanning of the same microarray at different scanner settings and estimate spot intensities from these combined data.\\ The {\tt multiscan} package implements the method of Khondoker \emph{et al.} (2006) for estimating gene expressions from multiple laser scans of hybridised microarrays. The method is based on a non-linear functional regression model with both additive and multiplicative error terms. Maximum likelihood estimation based on a Cauchy distribution is used to fit the model, which reduces the sampling variability in expression estimates and is able to estimate gene expressions taking account of outliers and the systematic bias caused by signal censoring of highly expressed genes.\\ The package contains a function {\tt multiscan}, and a small data set {\tt murine} for illustrating the use of the function. The function produces output of class {\tt multiscan}. S3 print and plot methods are also defined for this new class.\\ After installation, the package can be loaded using <>= library(multiscan) @ \section{Package documentation} \subsection{Help files} After installing and loading the package, the help file with detailed information of the {\tt multiscan} function, its usage, arguments and values, can be viewed by issuing the command {\tt help(multiscan)} or {\tt ?multiscan} at the R command prompt. \subsection{Package vignettes} To view the package vignette (this document) you need to use <>= vignette("multiscan") @ and a PDF file will open in your PDF reader. For Windows, the package vignette can also be accessed through the R menu bar item {\it Vignettes}. \section{Data} The only required argument for the {\tt multiscan} function is the {\tt data} argument which should be a numeric matrix or data frame containing the intensity data of a single microarray scanned at multiple (two or more) scanner settings. The data frame can be created by reading text or csv files into R, using {\tt read.table} or {\tt read.csv} command. \\ The number of rows ({\tt n}) of the data matrix should be equal to the number of spots/probes on the array, and the number of columns ({\tt m}) equal to the number of scans. Replicated probes on the array are treated as individual spots. Columns can be given in any order, but will be arranged in order of scanner's sensitivity before fitting the model. The data should be on the raw scale, i.e., not transformed!\\ An example data set, {\tt murine}, has been included as part of the package, which can be loaded using <>= data(murine) @ and the first few rows of the data can be viewed by typing <>= murine[1:10,] @ To see a brief description of this example data use <>= help(murine) @ \section{Example usage of {\tt multiscan}} Suppose that we want to obtain estimates of gene expressions combining the 4 scans of the {\tt murine} data set. This can be done using <<>>= data(murine) fit<-multiscan(murine) fit @ To view the components of the fitted model use <<>>= str(fit) @ The estimated gene expressions can be obtained by extracting the {\tt mu} component of the fitted object, i.e., by doing <<>>= gene.exprs<-fit$mu @ A plot of the fitted model can be created by using the generic {\tt plot} function. For example, the plot in Figure~1 was produced using <>= plot(fit) @ \begin{figure}[p] \label{fig:fitted} \begin{center} <>= plot(fit) @ \caption{A plot of the rescaled intensities ($y_{ij}/\hat{\beta}_j$) against the estimated gene expressions. The dashed lines represent the fitted model.} \end{center} \end{figure} To view the description of the other optional arguments of the {\tt multiscan} function please refer to the help file <>= help(multiscan) @ \section{Model diagnostics} The standardised residuals of the fitted model on the input data are stored in the {\tt sdres} component of the fitted {\tt multiscan} object. These residuals can be plotted against the rank of the estimated expression values using <>= op<-par(mfrow=c(2,2)) plot(fit,residual=TRUE) par(op) @ \begin{figure}[p] \label{fig:sdres} \begin{center} <>= op<-par(mfrow=c(2,2)) plot(fit,residual=TRUE) par(op) @ \caption{A plot of standardised residuals against the rank of estimated gene expressions. The dashed lines show 95\% probability limits ($\pm 12.71$).} \end{center} \end{figure} The resulting plot is shown in Figure~2. \section{Computing time and memory usage} The function is computationally slow and memory-intensive. That is due to the nested iteration loops of the numerical optimization of the likelihood function involving a large number ($n+m+2$) of parameters. The optimization uses an alternating algorithm with the Nelder-Mead simplex method (Nelder and Mead, 1965) in the inner loops. The function {\tt multiscan} directly uses the C function {\tt nmmin}, the internal code used in the general-purpose optimization tool {\tt optim}, for implementing the Nelder-Mead simplex method.\\ For large data sets with many tens of thousands of probes, it is recommended to consider first fitting the model using a random subset (e.g. 10,000 rows) of the data matrix, and then using the estimated scanning effects and scale parameters obtained as initial values for fitting the model to the full data set.\\ \section{Alternative software} A web interface, created by David Nutter of Biomathematics \& Statistics Scotland (BioSS), based on the original Fortran code written by Khondoker \emph{et al.} (2006) is available at\\ {\tt http://www.bioss.ac.uk/ktshowcase/create.cgi}. Although it uses the same algorithm, results from the web interface may not be exactly identical to that of {\tt multiscan} as it uses a different (non-free IMSL routine) implementation of the Nelder-Mead simplex method. \clearpage \section{Acknowledgments} We thank Marc Carlson, our colleague Thorsten Forster, and the anonymous reviewers for their useful comments on the earlier version of the package. The work of M.R.K. and C.A.G. in Khondoker \emph{et al.} (2006) was supported by the Scottish Executive Environment and Rural Affairs Department (SEERAD). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{References} \begin{description} \item Khondoker, M. R., Glasbey, C. A. and Worton, B. J. (2006). Statistical estimation of gene expression using multiple laser scans of microarrays. \emph {Bioinformatics} {\bf 22}, 215--219. \item Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. \emph{The Computer Journal} {\bf 7}, 308--313. \end{description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}