%\VignetteIndexEntry{marrayInput Introduction}
%\VignetteKeywords{Preprocessing, Expression Analysis}
%\VignetteDepends{marray}
%\VignettePackage{marray}
\documentclass[11pt]{article}

\usepackage{amsmath,fullpage, graphicx}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\parindent 0in

\bibliographystyle{abbrvnat}

\begin{document}

\title{\bf Introduction to the Bioconductor marray package : Input component}


\author{Yee Hwa Yang$^1$ and Sandrine Dudoit$^2$}

\maketitle
\begin{center}
1. Department of Medicine, University of California, San Francisco,\\
   {\tt jean@biostat.berkeley.edu}\\
2. Division of Biostatistics, University of California, Berkeley.
\end{center}


% library(tools) 
% setwd("C:/MyDoc/Projects/madman/Rpacks/marray/inst/doc")
% Rnwfile<-file.path("C:/MyDoc/Projects/madman/Rpacks/marray/inst/doc","marrayInput.Rnw")
% options(width=65)
% Sweave(Rnwfile,pdf=TRUE,eps=TRUE,stylepath=TRUE,driver=RweaveLatex())


\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Overview}

This document provides a tutorial for the {\bf data input} component of
the {\tt marray} package.  This is similar to the previous {\tt
marrayInput} package which has now been combined with the suite of other
four packages for diagnostic plots and normalization of cDNA microarray
data.  This package relies on object--oriented class/method mechanism,
provided by the R {\tt methods} package, to allow efficient and
systematic representation and manipulation of microarray data. \\

This vignette describeds functionality for reading microarray data into
R, such as intensity data from image processing output files (e.g. {\tt
.spot} and {\tt .gpr} files for the {\tt Spot} and {\tt GenePix}
packages, respectively) and textual information on probes and targets
(e.g. from gal files and god lists). A {\tt tcltk} widget is supplied to
facilitate and automate data input and the creation of microarray
specific R objects for storing these data. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Getting started}

To load the {\tt marray} package in your R session, type {\tt
library(marray)}.  We demonstrate the functionality of this R packages
using gene expression data from the Swirl zebrafish experiment which is
included as part of the package. To load the swirl dataset, use {\tt
data(swirl)}, and to view a description of the experiments and data,
type {\tt ?  swirl}.  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Case study: Swirl zebrafish microarray experiment}\label{sData} 

We demonstrate the functionality of this collection of R packages using
gene expression data from the Swirl zebrafish experiment. These data
were provided by Katrin Wuennenberg--Stapleton from the Ngai Lab at UC
Berkeley. (The swirl embryos for this experiment were provided by David
Kimelman and David Raible at the University of Washington.)  This
experiment was carried out using zebrafish as a model organism to study
early development in vertebrates. Swirl is a point mutant in the BMP2
gene that affects the dorsal/ventral body axis. Ventral fates such as
blood are reduced, whereas dorsal structures such as somites and
notochord are expanded. A goal of the Swirl experiment is to identify
genes with altered expression in the swirl mutant compared to wild--type
zebrafish. Two sets of dye--swap experiments were performed, for a total
of four replicate hybridizations. For each of these hybridizations,
target cDNA from the swirl mutant was labeled using one of the Cy3 or
Cy5 dyes and the target cDNA wild--type mutant was labeled using the
other dye. Target cDNA was hybridized to microarrays containing 8,448
cDNA probes, including 768 controls spots (e.g. negative, positive, and
normalization controls spots). Microarrays were printed using $4 \times
4$ print--tips and are thus partitioned into a $4 \times 4$ grid
matrix. Each grid consists of a $22 \times 24$ spot matrix that was
printed with a single print--tip. Here, spot row and plate coordinates
should coincide, as each row of spots corresponds to probe sequences
from the same 384 well--plate.\\

Each of the four hybridizations produced a pair of 16--bit images, which
were processed using the image analysis software package {\tt Spot}
\citep{Spot,Image}.  The dataset includes four output files {\tt
swirl.1.spot}, {\tt swirl.2.spot}, {\tt swirl.3.spot}, and {\tt
swirl.4.spot} from the {\tt Spot} package.  Each of these files contains
8,448 rows and 30 columns; rows correspond to spots and columns to
different statistics from the {\tt Spot} image analysis output. The file
{\tt fish.gal} is a gal file generated by the {\tt GenePix} program; it
contains information on individual probe sequences, such as gene names,
spot ID, spot coordinates.  Hybridization information for the mutant and
wild--type target samples is stored in {\tt SwirlSample.txt}. All
fluorescence intensity data from processed images are also included in
this package (see Section \ref{sInput} for greater details).\\

To load the swirl dataset, use {\tt data(swirl)}, and to view a
description of the experiments and data, type {\tt ? swirl}.  Below,
we give step--by--step instructions for reading the swirl data into
R. For convenience, we have also stored the resuls in the object {\tt
swirl} of class {\tt marrayRaw}.

<<eval=TRUE>>=
library(marray)
data(swirl)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Package {\tt marrayInput} --  Reading microarray data into
  R}\label{sInput} 

We begin our analysis of microarray data with the fluorescence
intensities produced by image processing of the microarray scanned
images. These data are typically stored in tables whose rows correspond
to the spotted probe sequences and columns to different spot statistics:
e.g. grid row and column coordinates, spot row and column coordinates,
red and green background and foreground intensities for different
segmentation and background adjustment methods, spot morphology
statistics, etc. For the {\tt GenePix} image processing software, these
are the {\tt .gpr} files, and for {\tt Spot}, these are the {\tt .spot}
files. We also consider probe and target textual information stored, for
example, in {\tt .gal} and {\tt .gdl} (god list) files. The main
functions in the {\tt marrayInput} package are {\tt read.marrayLayout},
{\tt read.marrayInfo}, and {\tt read.marrayRaw}, which create objects of
classes {\tt marrayLayout}, {\tt marrayInfo}, and {\tt marrayRaw},
respectively. Widgets are provided for each of these functions to
facilitate data entry.\\

For the Swirl zebrafish experiment, textual information and fluorescence
intensity data from processed images were included as part of the
package and can be accessed as follows, where {\tt datadir} is the name
of the R package sub--directory containing the data files.

<<eval=TRUE>>=
datadir <- system.file("swirldata", package="marray")
dir(datadir)
@

In general, microarray data consist of three distinct components; probes
(genes) information, target(samples) information and measured gene
expression levels information. Analyzing expression intensities alone
with no corresponding probes or target information is
meaningless. Therefore a data structure \Robject{marrayRaw} is created
to store and link these information together in one R object.

\subsection{ Reading target information} 
We refer to {\it target file} as a file that lists the microarrays
hybridization and describes which RNA samples were hybridized to each
array. A target file is typically a tab-delimited text file which
include at least the complete and {\bf exact} name of each image
processing file you would like to include in the data analysis and the
corresponding names for the Cy3 and Cy5 labeled sample information. It
is also informative to include other variables of interest that are
useful for downstream analysis or for quality assessment. Examples
include subject identification number, gender, age, date of
hybridization, scanning conditions amongst others.\\

The main functions in the {\tt marray} package for this purpose is {\tt
read.marrayInfo}, which will create an R object of class {\tt
marrayInfo}.  Objects of class {\tt marrayInfo} may be used to store
information on probe sequences and target samples. For example, reading
in the target information for the swirl experiment can done with

<<eval=TRUE>>=
swirlTargets <- read.marrayInfo(file.path(datadir, "SwirlSample.txt"))
summary(swirlTargets)
@

\subsection{Reading probes related information}
Probes related information refers to descriptions of the spotted probe
sequences (e.g. matrix of gene names, annotation, notes on printing
conditions). Printing conditions or array fabrication information
include the dimensions of the spot and grid matrices, and, for each
probe on the array, its grid matrix and spot matrix coordinates. In
addition, we also include plate origin of the probes, and information on
the spotted control sequences (e.g. negative controls, housekeeping
genes, spiked in-control probes and many others). These information are
store separately using two objects: an object of class
\Robject{marrayInfo} on the probes annotation information and an object
of class {\tt marrayLayout} to store arrays fabrication information. 

There are two ways to read probes related information: the {\bf first}
method is to use the function {\tt read.Galfile} as follow:
<<eval=TRUE, echo=TRUE>>=
galinfo <- read.Galfile("fish.gal", path=datadir) 
names(galinfo)
@ 
Users can modify the arguments \Rfunction{info.id} and
\Rfunction{layout.id} to specify which column names or index represent
probe annotation and printer layout (array fabrication) information
respectively.  For example the following code reads in the galfile {\tt
fish2.gal} where probe information are stored under the columns {\tt
Gene ID} and {\tt Gene description} and the printeralayout information
is stored under the columns {\tt Grid}, {\tt Row} and {\tt Column}.

%% <<eval=FALSE, echo=TRUE>>=
\begin{verbatim}
> fish2Gal <- read.Galfile(galfile="fish2.txt", 
                           info.id = c("Gene ID", "Gene description"), 
                           layout.id = c(Block="Grid", Row="Row",
	            	   Column="Column"), labels="Gene ID")
\end{verbatim}
%%  @

This function returns a list of 3 components. The first \Robject{gnames}
is an \Robject{marrayInfo} object storing probe annotation information;
the second \Robject{layout} is an \Robject{marrayLayout} object storing
array fabrication (printing) information and lastly a numerical vector
\Robject{neworder} which provides a resorting of data. The probes are
assumed to be ordered and numbered consecutively starting from the top
left grid and the top left spot within each grid.  For most standard
array layout, we typically recommend using this method.\\

{\it Note:} The slot \Robject{maSub} is included to allow importing data
from non-complete arrays. \Robject{maSub} is a "logical" vector
indicating which spots are currently being stored in the slots
containing Cy3 and Cy5 background and foreground fluorescence
intensities.

The {\bf second} method uses both functions {\tt read.marrayLayout} and
{\tt read.marrayInfo} to read and store information on array fabrication
and probe annotation information respectively.  This is usually done for
more complex array structures.  For example, reading in the probe
annotation information for the swirl experiment can done with:

<<eval=TRUE, echo=TRUE>>= 
swirl.gnames <- read.marrayInfo(file.path(datadir, "fish.gal"),
info.id=4:5, labels=5, skip=21) 
summary(swirl.gnames) 
@ 

The following command stores such layout information in the object {\tt
swirl.layout} of class {\tt marrayLayout}.  The location of the control
spots is extracted from the fourth ({\tt ctl.col=4}) column of the file
{\tt fish.gal}.
<<eval=TRUE>>=
swirl.layout <- read.marrayLayout(fname=file.path(datadir, "fish.gal"),
                                  ngr=4, ngc=4, nsr=22, nsc=24,
                                  skip=21,ctl.col=4)
ctl<-rep("Control",maNspots(swirl.layout))
ctl[maControls(swirl.layout)!="control"]  <- "probes"
maControls(swirl.layout)<-factor(ctl)
summary(swirl.layout)
@

\subsection{Reading gene-expression data}
Microarray image processing results are stored in ASCII files and by
default, assumed to be tab-delimited. These can be loaded into R using
\Rfunction{read.marrayRaw} or customized functions like
\Rfunction{read.Spot}, \Rfunction{read.Agilent} and
\Rfunction{read.GenePix} for {\tt Spot}, {\tt Agilent} and {\tt GenePix}
output, respectively. The customized functions are simply ``wrapper''
functions around {\tt read.marrayRaw} which extract relevant spot
statistics for different image processing packages. In addition, these
functions will also setup the probe annotation and array layout
information. The following command illustrate illustrate how to read in
the raw expression data for the swirl data.

<<eval=TRUE, echo=TRUE>>=
mraw <- read.Spot(path=datadir, 
                  layout=galinfo$layout, 
                  gnames=galinfo$gnames, 
                  target=swirlTargets)
summary(mraw)
@

For any arbitrary image analysis output file, we can use the function
{\tt read.marrayRaw}.  The function takes as its main argument a list of
names for files containing the intensity data (e.g. {\tt GenePix} output
files {\tt .gpr}). It also takes as arguments the names of already
created layout, probe, and target description objects, e.g., {\tt
swirl.layout}, {\tt swirl.gnames}, and {\tt swirlTargets} for the Swirl
experiment. The following commands read in all the {\tt Spot} files
residing in the {\tt datadir} directory. The arguments further specify
that the red and green foreground intensities are stored under the
headings {\tt Rmean} and {\tt Gmean}, and that the red and green
background intensities are store under the headings {\tt morphR} and
{\tt morphG}, respectively.

%%<<eval=FALSE>>=
\begin{verbatim}
> fnames <- as.vector(swirlTargets@maInfo[,1])
> swirl.raw <- read.marrayRaw(fnames, path = datadir,
                            name.Gf = "Gmean", name.Gb = "morphG",
                            name.Rf = "Rmean", name.Rb = "morphR",
                            layout = swirl.layout,
                            gnames = swirl.gnames,
                            targets = swirlTargets
                            )
\end{verbatim}
%%@


 \section{Other input functions}

{\bf Widget input functions} \\

To facilitate the creation of microarray
 data objects, each of these three input functions has a corresponding
 {\tt tcltk} widget: {\tt widget.marrayLayout}, {\tt widget.marrayInfo},
 and {\tt widget.marrayRaw}.  A screen--shot of the {\tt marrayRaw}
 widget is shown in Figure \ref{fig:WidgetRaw}; the command to launch
 the widget is as follows (here, {\tt ext} specifies the image output
 file extension):

\begin{verbatim}
>  widget.marrayRaw(path=datadir, ext="spot")
\end{verbatim}


{\bf Wrapper input functions}\\
As mentioned before, for users who prefer command line input for a
specific class of image processing output files, we have defined three
additional functions. The functions {\tt read.Spot}, {\tt read.GenePix},
{\tt read.Agilent} and {\tt read.SMD} automate the creation of {\tt
marrayRaw} objects from {\tt Spot}, {\tt GenePix} and {\tt Agilent}
image analysis files, and from the Stanford Microarray Database (SMD)
raw data files ({\tt .xls}). The main arguments to these functions are a
list of files and the directory path of the files.  The following
commands read two specific files from the {\tt datadir} directory.

%RG: changed to get a build
%% %<<eval=FALSE>>=
\begin{verbatim}
> fnames <- dir(path=datadir,pattern=paste("*", "spot", sep="\."))[1:2]
> swirl <- read.Spot(fnames, path=datadir,
          layout = swirl.layout,
          gnames = swirl.gnames,
          targets = swirl.samples)
\end{verbatim}
%%%@

Alternatively, without specifying any arguments, the functions {\tt
read.spot} and {\tt read.GenePix} by default will read in all {\tt
Spot} or {\tt GenePix} files within a current working directory. One
has the option of setting the layout, probe, and target information
manually at a later stage.

%<<eval=FALSE>>=
\begin{verbatim}
> swirl <- read.Spot()
> test.raw <- read.GenePix()
\end{verbatim}
%@

\begin{figure}[htbp]
\centering
\includegraphics{widget1}
\caption{Screenshot of the widget for creating objects of class {\tt
    marray} from image processing output files.}
    \protect\label{fig:WidgetRaw}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
{\bf Note: Sweave.} This document was generated using the \Rfunction{Sweave}
function from the R \Rpackage{tools} package. The source file is in the
\Rfunction{/inst/doc} directory of the package \Rpackage{marray}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\bibliography{marrayPacks}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}