% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{gcrma1.2} %\VignetteKeywords{Preprocessing, Affymetrix} %\VignetteDepends{affy,Biostrings,tools,splines} %\VignettePackage{gcrma} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \author{Zhijin(Jean) Wu, Rafael Irizarry} \begin{document} \title{Description of gcrma package} \maketitle \tableofcontents \section{Introduction} The \Rpackage{gcrma} package is part of the Bioconductor\footnote{\url{http://www.bioconductor.org/}} project. \Rpackage{gcrma} adjusts for background intensities in Affymetrix array data which include optical noise and non-specific binding (NSB). The main function \Rfunction{gcrma} converts background adjusted probe intensities to expression measures using the same normalization and summarization methods as \Rfunction{rma} (Robust Multiarray Average). \Rpackage{gcrma} uses probe sequence information to estimate probe affinity to non-specific binding (NSB). The sequence information is summarized in a more complex way than the simple GC content. Instead, the base types (A,T,G or C) at each position (1-25) along the probe determine the {\it affinity} of each probe. The parameters of the position-specific base contributions to the probe affinity is estimated in an NSB experiment in which only NSB but no gene-specific bidning is expected. In version 2.0.0 we give options to the users to obtain these parameters from their choice. With the probe affinities available, we estimate the relationship between the amount of NSB and the probe sequences. Specifically, we estimate the function $$NSB=h(affinity)$$ by fitting a loess curve through $$\mbox{MM probe intensities} \sim \mbox{MM probe affinities}.$$ In version 2.0.0 we also allow the use of any list of negative control(NC) probes instead of MM. The background adjusted intensity is computed as the posterior mean of specific binding given the observed intensities and the probe sequences. This is done in function \Rfunction{bg.adjust.gcrma}. The background adjusted data is then converted to expression measures with function \Rfunction{rma} with the option \Rfunarg{background=FALSE} to avoid another round of background correction. The following terms are used throughout this document: \begin{description} \item[probe] oligonucleotides of 25 base pair length used to probe RNA targets. \item[perfect match] probes intended to match perfectly the target sequence. \item[$PM$] intensity value read from the perfect matches. \item[mismatch] the probes having one base mismatch with the target sequence intended to account for non-specific binding. \item[$MM$] intensity value read from the mis-matches. \item[probe pair] a unit composed of a perfect match and its mismatch. \item[affyID] an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array. \item[probe pair set] $PM$s and $MM$s related to a common {\it affyID}. \item[{\it CEL} files] contain measured intensities and locations for an array that has been hybridized. \item[{\it CDF} file] contain the information relating probe pair sets to locations on the array. \item[{\it probe} file] contain the information relating probe sequences to locations on the array. \end{description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{What's new in version 2.0.0} We added a function \Rfunction{bg.adjust.gcrma} for the convenience of performing gcrma background adjustment only, without summarizing into gene level data. In earlier versions, we compute the sequence-determined probe affinities using parameters estimated from data obtained in our non-specific binding (NSB) experiment. In version 2.0.0 we give various options for the users to choose their own sources of such data. Users can choose to \begin{enumerate} \item compute probe affinities based on their own non-specific biniding experiment (typically all probes in such experiments will contain non-specific bnidng, but the users can choose their list of probes) \item compute probe affinties using each experimental array, with the user-defined negative control (NC) probes (MMs will be used if NC not specified). \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Options for \Rfunction{gcrma}} Here we explain the major options in gcrma. Examples will be given in more details in the next section. \begin{enumerate} \item \Robject{affinity.info} This is in general an AffyBatch object but you can leave it as {\it NULL} and let \Rfunction{gcrma} compute it automatically. Instead of storing probe intensities, it contains probe affinities. When affinity.info is set to {\it NULL} (default), it will be computed within the function gcrma. To compute the affinity of each probe, we first obtain base-postion profiles (the contribution of each base type at each postion along the probe) from non-specific binding data. The user can choose to use the developers reference data or use each experimetnal array with indexes of negative control (NC) probes. If the NC probe index is not provided, MM probes will be used as NC probes. Some users express the concern that their array type may behave differently from the human hgu95 array, which was used in the developers' non-specific binding experiment. The user can choose to run an independent non-specific binding experiment for her/his own research. The affinity.info can be obtained separately by \begin{verbatim} my.affinity.info <- compute.affinities.local(myNsbData) \end{verbatim} and this should be passed to function gcrma \begin{verbatim} est<-gcrma(myExprData,affinity.info=my.affinity.info) \end{verbatim} \item \Rfunarg{type} The options for \Rfunarg{type} are \begin{itemize} \item {\it fullmodel}: uses both probe sequence information and observed MM probe intensities \item {\it affinities}: uses probe sequence information and ignores MMs \item {\it MM}: uses MM probe intensities and ignores sequence information \end{itemize} \item \Rfunarg{fast} When \Rfunarg{fast} is set to {\it TRUE}, an ad hoc procedure is used to speed up the non-specific binding correction. The default in previous version has been \Rfunarg{fast=TRUE}, but is changed to \Rfunarg{fast=FALSE} in version 2.0.0. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting started: the simplest example} \label{sec:get.started} You will need the following libraries: \Rpackage{affy}, \Rpackage{MASS}, cdf and probe packages for your chip type(such as \Rpackage{hgu95av2cdf} and \Rpackage{hgu95av2probe}). The first thing you need to do is {\bf load the package}. \begin{Sinput} R> library(gcrma) ##load the gcrma package \end{Sinput} %%<>= %%library(gcrma) %%@ If all you want is to go from probe level data ({\it Cel} files) to expression measures here are some quick ways. The quickest way of reading in data and getting expression measures is the following: \begin{enumerate} \item Create a directory, move all the relevant {\it CEL} files to that directory \item Start R in that directory. \item If using the Rgui for Microsoft Windows make sure your working directory contains the {\it Cel} files (use ``File -> Change Dir'' menu item). \item Load the library. \begin{Sinput} R> library(gcrma) ##load the gcrma package \end{Sinput} \item Read in the data and create an expression, using RMA for example. \begin{verbatim} R> Data <- ReadAffy() ##read data in working directory R> eset <- gcrma(Data) \end{verbatim} \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{More Examples with options} For illustration we will use the Dilution data. \begin{Sinput} R> require(affydata) R> data(Dilution) \end{Sinput} %%<>= %%require(affydata) %%data(Dilution) %%@ \begin{enumerate} \item Obtain expression values For example, \begin{Sinput} R> Dil.expr1<-gcrma(Dilution) \end{Sinput} %%<>= %%Dil.expr1<-gcrma(Dilution) %%@ To use the faster ad hoc procedure one can call \begin{Sinput} R> Dil.expr2<-gcrma(Dilution,fast=TRUE) \end{Sinput} %%<>= %%Dil.expr2<-gcrma(Dilution,fast=TRUE) %%@ Suppose the user has his/her own NSB experiment and wants to compute affinity.info with that experiment. \begin{Sinput} R> myNsbData <- ReadAffy(``mynsb.cel'') R> my.affinity.info <- compute.affinities.local(myNsbData) R> Dil.expr3 <- gcrma(myExprData,affinity.info=my.affinity.info) \end{Sinput} %% <>= %% myNsbData<-ReadAffy("mynsb.CEL") %% my.affinity.info <- compute.affinities.local(myNsbData) %% Dil.expr3<-gcrma(Dilution,affinity.info=my.affinity.info) %% @ Suppose the user would like to use NC probes in the experimental data to estimate probe affinities. Here we use MM probes as example for NC probes. \begin{Sinput} R> mmIndex <- unlist(indexProbes(Dilution,"mm")) R> Dil.expr4 <- gcrma(Dilution,affinity.source="local",NCprobe=mmIndex) \end{Sinput} %%<>= %%mmIndex <- unlist(indexProbes(Dilution,"mm")) %%Dil.expr4 <- gcrma(Dilution,affinity.source="local",NCprobe=mmIndex) %%@ Since the MM probes are default setting when NCprobe is not provided, the above gives identical result as \begin{Sinput} R> Dil.expr5 <- gcrma(Dilution,affinity.source="local") \end{Sinput} %%<>= %%Dil.expr5 <- gcrma(Dilution,affinity.source="local") %%@ \item Background adjustment only The function \Rfunction{bg.adjust.gcrma} allows one to perform background adjustment only. \begin{Sinput} R> Dil.bgadj <- bg.adjust.gcrma(Dilution) R> Dil.expr6 <- rma(Dil.bgadj,background=FALSE) \end{Sinput} %%<>= %%Dil.bgadj <- bg.adjust.gcrma(Dilution) %%Dil.expr6 <- rma(Dil.bgadj,background=FALSE) %%@ \Rfunction{gcrma} also tries to adjust for specific binding using probe sequence. The user can turn off this feature by specifying \Rfunarg{GSB.adjust=FALSE}. \begin{Sinput} R> Dil.bgadj <- bg.adjust.gcrma(Dilution,GSB.adjust=FALSE) \end{Sinput} %%<>= %%Dil.bgadj <- bg.adjust.gcrma(Dilution,GSB.adjust=FALSE) %%@ \end{enumerate} \section{Efficient use of gcrma} Most users deal with one or a few types of GeneChip arrays for repeatedly. To use gcrma efficiently, one can compute the \Robject{affinity.info} and save it, thus save the time to compute \Robject{affinity.info} every time \Rfunction{gcrma} (or \Rfunction{bg.adjust.gcrma} is called. For example, the Dilution data is from hgu95av2 chips. We can compute affinity.info of chip type "hgu95av2" using the NSB data provided in \Rpackage{gcrma} and save it in a file. \begin{Sinput} R> affinity.info.hgu95av2 <- compute.affinities("hgu95av2") R> save(affinity.info.hgu95av2,file = "affinity.hgu95av2.RData") \end{Sinput} or \begin{Sinput} R> affinity.info.hgu95av2 <- compute.affinities(cdfName(Dilution)) R> save(affinity.info.hgu95av2,file = "affinity.hgu95av2.RData") \end{Sinput} Now when you need to call \Rfunction{gcrma} for the same type of array, there is no need to compute \Robject{affinity.info} again: \begin{Sinput} R> library(gcrma) R> data(Dilution) R> load("affinity.hgu95av2.RData") R> Dil.expr7 <- gcrma(Dilution,affinity.info=affinity.info.hgu95av2) \end{Sinput} \newpage %\bibliographystyle{plainnat} %\bibliography{affy} \end{document}