% automatic manuscript creation for frma
% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{frma: Preprocessing for single arrays and array batches}
%\VignetteDepends{frma, hgu133afrmavecs, frmaExampleData}
%\VignettePackage{frma}
\documentclass[12pt]{article}
\usepackage{hyperref}
\usepackage[authoryear, round]{natbib}

\textwidth=6.2in
\textheight=8.5in
\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand\Rpackage[1]{{\textsf{#1}\index{#1 (package)}}}
\newcommand\dataset[1]{{\textit{#1}\index{#1 (data set)}}}
\newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}}
\newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}}
\newcommand\Rfunarg[1]{{\small\texttt{#1}}}
\newcommand\Robject[1]{{\small\texttt{#1}}}

\author{Matthew N. McCall}

\begin{document}
\title{Preprocessing and Barcoding of Single Microarrays and Microarray Batches (frma)}
\maketitle
\tableofcontents

\section{Frozen Robust Multiarray Analysis (fRMA)}
Frozen RMA (fRMA) is a microarray preprocessing algorithm that allows
one to analyze microarrays individually or in small batches and then
combine the data for analysis. This is accomplished by utilizing
information from the large publicly available microarray databases. In
particular, estimates of probe-specific effects and variances are
precomputed and frozen. Then, with new data sets, these are used in
concert with information from the new array(s) to normalize and
summarize the data. 

This section describes \Rfunction{frma}, which implements the methods
proposed in the paper, \emph{Matthew N. McCall, Benjamin M. Bolstad,
and Rafael A. Irizarry, ``FROZEN ROBUST MULTI-ARRAY ANALYSIS (fRMA)''
(May 2009). Johns Hopkins University, Dept. of Biostatistics Working
Papers. Working Paper 189.} Ideally someone interested in using this
package would first read that paper and then proceed to the sections
below.

\subsection{Getting Started}

If all you want is to go from probe level data ({\it CEL} files) to
expression measures here are some quick ways.

\subsubsection{Single Array: Median Polish}
If you want a quick single array version of RMA using median polish,
the quickest way of reading in data and getting expression measures is
the following:
\begin{enumerate}
\item Download and install frma and the appropriate data file (i.e. hgu133afrmavecs, hgu133plus2frmavecs, etc.).
\item Create a directory and move all the relevant
{\it CEL} files to that directory.
\item If using linux/unix, start R in that directory.
\item If using the Rgui for 
Microsoft Windows make sure your working directory contains the {\it
CEL} files (use ``File -> Change Dir'' menu item).
\item Load the library.
<<echo=T,results=hide>>=
library(frma)
@ 
\item Read in the data and preprocess using the ``median\_polish'' option.
<<echo=T,results=hide>>=
library(frmaExampleData)
data(AffyBatchExample)
object <- frma(AffyBatchExample, summarize="median_polish")
@ 
\end{enumerate}

\subsubsection{Single Array: Robust Weighted Average}
If you want the robust single array method described in the fRMA
paper, the quickest way of reading in data and getting expression
measures is the following:
\begin{enumerate}
\item Download and install frma and the appropriate data file (i.e. hgu133afrmavecs, hgu133plus2frmavecs, etc.).
\item Create a directory and move all the relevant
{\it CEL} files to that directory.
\item If using linux/unix, start R in that directory.
\item If using the Rgui for 
Microsoft Windows make sure your working directory contains the {\it
CEL} files (use ``File -> Change Dir'' menu item).
\item Load the library.
<<echo=T,results=hide>>=
library(frma)
@ 
\item Read in the data and preprocess using the ``robust\_weighted\_average'' option.
<<echo=T,results=hide>>=
library(frmaExampleData)
data(AffyBatchExample)
object <- frma(AffyBatchExample, summarize="robust_weighted_average")
@ 
\end{enumerate}

\subsubsection{Multiple Arrays: Random Effect Model}
If you want to preprocess a batch of arrays using the model described
in the fRMA paper, the quickest way of reading in data and getting
expression measures is the following:
\begin{enumerate}
\item Download and install frma and the appropriate data file (i.e. hgu133afrmavecs, hgu133plus2frmavecs, etc.).
\item Create a directory and move all the relevant
{\it CEL} files to that directory.
\item If using linux/unix, start R in that directory.
\item If using the Rgui for 
Microsoft Windows make sure your working directory contains the {\it
CEL} files (use ``File -> Change Dir'' menu item).
\item Load the library.
<<echo=T,results=hide>>=
##library(frma)
@ 
\item Read in the data and preprocess using the ``random\_effect'' option.
<<echo=T,results=hide>>=
##library(frmaExampleData)
##data(AffyBatchExample)
##object <- frma(AffyBatchExample, summarize="random_effect")
@ 
\end{enumerate}

In each of the cases above, the final \Robject{object} will be an ExpressionSet. To obtain a matrix of gene-level expression values, enter the
following command:
<<echo=T,results=hide>>=
e <- exprs(object)
@ 

Depending on the size of your dataset and on the memory available to
your system, you might experience errors like `Cannot allocate vector
\ldots'.  An obvious option is to increase the memory available to
your R process (by adding memory and/or closing external
applications). You might also consider analyzing the data in smaller
batches.

\subsection{Advanced Options}
\subsubsection{Summarization Methods}
\emph{Summarization} refers to the method used to combine probe-level
expression values to obtain gene-level expression estimates. There are
several summarization methods that one can choose from when running
frma. A brief description of each of the methods follow:
\begin{itemize}
  \item {\bf{average}:} compute the mean of the probes in each probeset
  \item {\bf{median}:} compute the median of the probes in each probeset
  \item {\bf{median\_polish}:} subtract the probe-effect and then compute the median of the probes in each probeset
  \item {\bf{weighted average}:} compute a weighted average of the probes in each probeset with weights equal to the inverse of the sum of the precomputed within and between batch variance estimates.
  \item {\bf{robust\_weighted\_ average}:} compute a weighted average of the probes in each probeset with weights equal to the weights returned by an M-estimation procedure divided by the sum of the precomputed within and between batch variance estimates. 
  \item {\bf{random\_ effect}:} the robust weighted average method adapted for a batch of new arrays.
\end{itemize}

For a more in depth description of the final two methods, see
\emph{Matthew N. McCall, Benjamin M. Bolstad, and Rafael A. Irizarry,
``FROZEN ROBUST MULTI-ARRAY ANALYSIS (fRMA)'' (May 2009). Johns
Hopkins University, Dept. of Biostatistics Working Papers. Working
Paper 189.}

\subsubsection{Input Vectors}
While the vast majority of users will want to stick to the precomputed vectors provided in the frmavecs packages, the frma function will accept user-supplied vectors. These should be given in a list with elements normVec, probeVec, probeVarBetween, probeVarWithin, and probesetSD. A description of each of these elements follows:
\begin{itemize}
  \item {\bf{normVec}:} a vector containing values of the reference distribution to which samples will be quantile normalized.
  \item {\bf{probeVec}:} a vector of probe-effect estimates.
  \item {\bf{probeVarBetween}:} a vector of the between batch variance for each probe.
  \item {\bf{probeVarWithin}:} a vector of the within batch variance for each probe.
  \item {\bf{probesetSD}:} a vector of average within probeset standard deviations.
\end{itemize}

For further information on how these vectors are computed in the frmavecs packages, see the frmaTools package.

\subsubsection{Output Parameters}
While the default is to only return the gene-level expression
estimates and if applicable their standard errors, the frma function
can also return additional information about the estimates depending
on the summarization method chosen. A description of the arguments
that can be included in the output.param argument follows:
\begin{itemize}
  \item {\bf{weights}:} the weights from the M-estimation procedure.
  \item {\bf{residuals}:} the residuals from fitting the probe-level model.
  \item {\bf{randomeffects}:} estimated random effects from fitting the probe-level model with random effect summarization.
\end{itemize}

Not all of these outputs are available for all of the summarization
methods.

\section{Creation of Gene Expression Barcodes}
The barcode algorithm is designed to estimate which genes are
expressed and which are unexpressed in a given microarray
hybridization. This is accopmlished by: (1) using the distribution of
observed $log_2$ intensities across a wide variety of tissues to
estimate an expressed and an unexpressed distribution for each gene,
and (2) for each gene in a sample, denoting it as expressed if its
observed $log_2$ intensity is more likely under the expressed
distribution than under the unexpressed distribution and as
unexpressed otherwise. The first step is accomplished by fitting a
hierarchical mixture model to the plethora of publicly available
data. The second step is accomplished by determining where the
observed intensities from the new array fall in these
distributions. The output of our algorithm is a vector of ones and
zeros denoting which genes are estimated to be expressed (ones) and
unexpressed (zeros). We call this a \emph{gene expression barcode}.

This section describes \Rfunction{barcode}, which implements the
methods proposed in the paper, \emph{Matthew N. McCall, Michael
J. Zilliox, and Rafael A. Irizarry, ``Gene Expression Barcodes Based
on Data from 8,277 Microarrays'' (October 2009). Johns Hopkins
University, Dept. of Biostatistics Working Papers. Working Paper 200.}
Ideally someone interested in using this package would first read that
paper and then proceed to the sections below.

\subsection{Getting Started}
To create a gene expression barcode, one needs estimates of the gene
expression distributions -- specifically the mean and variance of the
unexpressed distribution for each gene. Fortunately, we have
precomputed these for 3 popular Affymetrix microarray platforms --
hgu133a, hgu133plus2, and mouse4302. To use one of these 3, simply
preprocess your data using the default options of \Rfunction{frma} and
then run \Rfunction{barcode} on the resulting object.

\subsubsection{Example}
\begin{enumerate}
\item Download and install the frma package and the appropriate data
package(s) (i.e. hgu133afrmavecs).
\item Create a directory and move all the relevant
{\it CEL} files to that directory.
\item If using linux/unix, start R in that directory.
\item If using the Rgui for Microsoft Windows make sure your working
directory contains the {\it CEL} files (use ``File -> Change Dir''
menu item).
\item Load the libraries.
<<echo=T,results=hide>>=
library(frma)
@ 
\item Read in the data and preprocess using the default options.
<<echo=T,results=hide>>=
library(frmaExampleData)
data(AffyBatchExample)
object <- frma(AffyBatchExample)
@ 
\item Create a gene expression barcode.
<<echo=T,results=hide>>=
bc <- barcode(object)
@ 
\end{enumerate}

\subsection{Output Options}

The default output of the \Rfunction{barcode} function is to return a vector of ones (expressed) and zeros (unexpressed); however, there are alternative output options. A brief description of each of these follows:
\begin{itemize}
  \item {\bf{weight}:} a vector of weights which roughly correspond to the probability of expression for each gene.
  \item {\bf{z-score}:} a vector of z-scores under the unexpressed normal distribution for each gene.
  \item {\bf{p-value}:} a vector of p-values under the unexpressed normal distributionfor each gene.
\end{itemize}


\end{document}