%\VignetteIndexEntry{AgiMicroRna}
%\VignetteKeywords{Preprocessing, Agilent}
%\VignetteDepends{AgiMicroRna}
%\VignettePackage{AgiMicroRna}

\documentclass{article}
\usepackage[latin1]{inputenc}
\usepackage[english]{babel}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\url}[1]{{\textit{#1}}}

\begin{document}

\title{AgiMicroRna}
\author{Pedro Lopez-Romero}
\maketitle

\section{Package Overview}

\Rpackage{AgiMicroRna} provides useful functionality for the processing, quality assessment
and differential expression analysis of Agilent microRNA array data. The package
uses a limma-like structure to generate the processed data in order to 
make statistical inferences about differential expression using the
linear model features implemented in \Rpackage{limma}. Standard Bioconductor 
objects are used so that other packages could be used as well. 

\Rpackage{AgiMicroRna} reads into R \cite{R} the scanned data 
exported by the \texttt{Agilent Feature
Extraction} (AFE) image analysis software \cite{AFE}. Standard graphical
utilities can be used to evaluate the quality of the data. 

\Rpackage{AgiMicroRna} includes a full data example that can be loaded into \texttt{R} in order 
to illustrate the capabilities of the package. The data  
come from human mesenchymal stem cells obtained from bone marrow.  
100 ng of each RNA sample were hybridized onto Agilent
Human microRNA Microarray v2.0 (G4470B, Agilent Technologies).  

The Human microRNA microarray v2.0 contains 723 human and 
76 human viral microRNAs, each of them replicated 16 times. 
There are 362 microRNAs interrogated by 2 different oligonucleotides, 45 microRNAs by 3 and 
390 microRNAs interrogated by 4 different oligonucleotides. Only 2 microRNAs are interrogated by the same 
oligonucleotide. The array contains also a set of positive and negative controls 
that are replicated a different number of times.  

For the statistical analysis we need an estimate of the expression measure for every
microRNA that has to be normalized between arrays. This processed signal 
is going to be used to make statistical inferences about the differential expression. 
In \Rpackage{AgiMicroRna} the
processed microRNA signal can be obtained using two different protocols. 
The first uses the \texttt{Total Gene Signal} (TGS) computed by
the AFE algorithm \cite{AFE} whereas the second obtains an estimate
of the gene signal using the \texttt{RMA} algorithm \cite{RMA}. 

In more detail, the data processing for the first protocol 
is accomplished according to the following sequential steps:
1) Obtaining the Total microRNA Gene Signal processed by AFE, 2) normalization
between arrays. For the RMA algorithm, the steps are slightly different:
1) The signal is background corrected using the exponential + normal
convolution model, 2) the background signal is normalized between arrays, and
3) the total gene signal is estimated from a linear model that takes into
account the probe effect. The estimates of the model are obtained using a 
robust methods such as the median polish.  

After obtaining the normalized total gene signal,
some of the genes are eliminated from the analysis using some of the quality flags 
that AFE attaches to each feature. Finally, the processed signal that is going to be used to make statistical
inferences is stored in a \Robject{ExpressionSet} object \cite{Gentleman}.

The differential expression analysis is accomplished using the linear model
features implemented in \Rpackage{limma} \cite{limma}. A linear model is fitted to each microRNA
gene so that the fold change between different experimental conditions
and their associated standard errors can be estimated. Empirical Bayes methods
are applied to obtain moderated statistics \cite{ebayes}.

\Rpackage{AgiMicroRna} contains different functions to extract useful information from the
objects generated by \Rpackage{limma}. A list of microRNAs with the different
statistics obtained from the differential expression analysis (M value,
moderated t and F statistics, p values and FDR, etc ) is given. In addition,
HTML files that contains links of the declared significant microRNAs to
the Sanger miRBase \url{http://microrna.sanger.ac.uk/} are given. MA plots
highlighting the DEGs are also generated.


\section{Target File}

\Rpackage{AgiMicroRna} has been primarily designed to produce a processed data to be
analyzed using the \Rpackage{limma} package.
First, a target file is needed in order to assign each scanned data file to a given experimental
group. The target file is a tab-delimited text format file \texttt{created by the user}
where we specify the factors that are going to
be included in the statistical model. 
The following columns have to be present in the target file.
A first column \texttt{FileName} is mandatory and includes the image data
files names. A second column \texttt{Treatment} is also mandatory and includes the treatment
effect. The third column,  \texttt{GErep} is also mandatory, and includes the treatment effect
in a numeric code, from 1 to n, being n the number of levels of the treatment effect. Other
columns in the target file are optional. They might included information about
other explanatory variables specifying other experimental conditions, such
age, gender, and blocking variables that take into the account the experimental
design (paired, blocked designed, etc). These variables should be included in the target file for
its eventual use in the limma model.


In the data example provided in \Rpackage{AgiMicroRna} we use microRNAs 
that have been measured in human mesenchymal stem cells obtained
from bone marrow of 2 independent donors. We want to compare 2 treatments MSC\_B and
MSC\_C with a control MSC\_A. For the sake of simplicity we use only 2 
replicates for each experimental condition. 
We define a treatment effect with 3 levels
(A,B and C). We need to specify a \texttt{GErep} variable to specify the treatment levels
using a numeric code, i.e. (1,2,3). In Addition, each treatment has been applied 
to stem cells that have been obtained from the same individuals, 
so we have a randomized blocked (by Subject) design. As we only have
two levels of the blocking variable (subject), this is also known as a paired
design. To consider the paired design in the statistical analysis we
have to add an additional \texttt{Subject} variable in the target file that relates the
individual to its sample. The target file for this example 
is shown in Table 1. 

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|c|c|l|}
\hline \em FileName & \em Treatment & \em GErep & \em Subject \\
\hline
mscA1.txt & A & 1 & 1 \\
mscA2.txt & A & 1 & 2 \\
mscB1.txt & B & 2 & 1 \\
mscB2.txt & B & 2 & 2 \\
mscC1.txt & C & 3 & 1 \\
mscC2.txt & C & 3 & 2 \\
\hline
\end{tabular}
\label{tab:Flag}
\caption{Targe file}
\end{center}
\end{table}

After the user have define the target text file specifying their 
experimental conditions, the  target file can be loaded into \texttt{R} 
using the \Rpackage{AgiMicroRna} function \Rfunction{readTargets}. 

\begin{Sinput}
## NOT RUN 
> library("AgiMicroRna")
> targets.micro=readTargets(infile="targets.txt",verbose=TRUE)
\end{Sinput}

The function \Rfunction{readTargets} returns a \Robject{data.frame}. We can use the
target included in \Rpackage{AgiMicroRna} to describe the microRNA data used to illustrate
the capabilities of the package. 


<<data>>=
library("AgiMicroRna")
data(targets.micro)
print(targets.micro)
@


\section{Reading the data}


We have used microRNA data example coming from human mesenchymal 
stem cells obtained from bone marrow of 2 independent donors.
100 ng of each RNA sample were hybridized onto Agilent
Human microRNA Microarray v2.0 (G4470B, Agilent Technologies) 
The chips were scanned using the Agilent G2567AA Microarray Scanner System
(Agilent Technologies) following manufacturer instructions.
Image analysis and data collection were carried out
using the \texttt{Agilent Feature Extraction 9.1.3.1} (AFE) \cite{AFE}.
with default settings. 

To read the scanned data files into \texttt{R} we use the \Rfunction{readMicroRnaAFE}. 
This function creates an object of a class \Robject{uRNAList}, similar to the \Robject{RGList} object
created by \Rpackage{limma} \cite{limma},  
that includes the variables that we need for the data processing and statistical analysis  
(see Table 2). In particular, the columns "gTotalGeneSignal", "gTotalProbeSignal", "gMeanSignal" and
"gProcessedSignal" loaded from the scanned data files, are stored in the  
following 4 different slots: \Robject{TGS}, \Robject{TPS}, \Robject{meanS} and \Robject{procS}. 
We have created this new class object (adopted from \Rpackage{limma}) to use more appropiate names
for the signal values that we use in AgiMicroRna. 

The \Rfunction{readMicroRnaAFE} calls a new function \Rfunction{read.agiMicroRna},
similar to the \Rfunction{read.maimages} in \Rpackage{limma}. \Rfunction{read.agiMicroRna}  
is internally used as follows:  

\begin{Sinput}
## NOT RUN 
dd=read.agiMicroRna(targets,
             columns=list(TGS="gTotalGeneSignal",
                                 TPS="gTotalProbeSignal",
                                 meanS="gMeanSignal",
                                 procS="gProcessedSignal"),
                   other.columns=list(IsGeneDetected="gIsGeneDetected",
                                        IsSaturated="gIsSaturated",
                                        IsFeatNonUnifOF="gIsFeatNonUnifOL",
                                        IsFeatPopnOL="gIsFeatPopnOL",
                                        BGKmd="gBGMedianSignal",
                                        BGKus="gBGUsed"),
                     annotation = c( "ControlType", "ProbeName","GeneName"),
                     verbose=TRUE)

\end{Sinput}

This implies that in the data files we must have all the columns that are indicated
in the calling to \Rfunction{read.agiMicroRna}. If any of those columns are missing, the 
\Rfunction{readMicroRnaAFE} will produce an error message. In this case, we will have to 
call the  \Rfunction{read.agiMicroRna} by ourselves, omitting those columns that are not 
present in the data files. For the data pre-processing and differential expression
analysis, the columns that we must read at least are: \Robject{gTotalGeneSignal},
\Robject{gMeanSignal}, \Robject{gIsGeneDetected}, \Robject{ControlType}, \Robject{ProbeName},
and \Robject{GeneName}. 

A typical use of \Rfunction{readMicroRnaAFE}  is like: 

\begin{Sinput}
## NOT RUN 
> dd.micro=readMicroRnaAFE(targets.micro,verbose=TRUE)
\end{Sinput}

\Rpackage{AgiMicroRna} contains \Robject{uRNAList} \texttt{dd.micro} that can be used
to explore the capabilities of the package. \texttt{dd.micro} can  
be loaded into \texttt{R} using the \Rfunction{data} command.  

<<data>>=
data(dd.micro)
class(dd.micro)
dim(dd.micro)
@

The variables stored in the \Robject{uRNAList} \texttt{dd.micro} 
are shown in Table 2. 

<<>>=
print(names(dd.micro))
@


\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em variable & \em data \\
\hline
\texttt{dd.micro\$TGS} & \texttt{gTotalGeneSignal} \\
\texttt{dd.micro\$TPS} & \texttt{gTotalProbeSignal} \\
\texttt{dd.micro\$meanS} & \texttt{gMeanSignal} \\
\texttt{dd.micro\$procS} & \texttt{gProcessedSignal} \\
\texttt{dd.micro\$targets} & File names \\
\texttt{dd.micro\$genes\$ProbeName} & Probe Name \\
\texttt{dd.micro\$genes\$GeneName} & microRNA Name \\
\texttt{dd.micro\$genes\$ControlType} & FLAG to specify the sort of feature \\
\texttt{dd.micro\$other\$gIsGeneDetected} & FLAG IsGeneDetected \\
\texttt{dd.micro\$other\$gIsSaturated} & FLAG IsSaturated \\
\texttt{dd.micro\$other\$gIsFeatNonUnifOL} & FLAG IsFeatNonUnifOL \\
\texttt{dd.micro\$other\$gIsFeatPopnOL} & FLAG IsFeatPopnOL \\
\texttt{dd.micro\$other\$gBGMedianSignal} & \texttt{gBGMedianSignal} \\
\texttt{dd.micro\$other\$gBGUsed} & \texttt{gBGUsed} \\
\hline
\end{tabular}
\label{tab:Flag1}
\caption{Variables stored in the \texttt{uRNAList} object by \texttt{readMicroRnaAFE}}
\end{center}
\end{table}

The \texttt{ProbeName}  is an Agilent-assigned identifier for the probe synthesized on the
microarray. The \texttt{GeneName}  is an identifier for the gene for which the probe
provides expression information. The target sequence identified by the systematic
name is normally a representative or consensus sequence for the gene.

AFE obtains the \texttt{gTotalGeneSignal} as the \texttt{TotalProbeSignal} 
times the number of probes per gene. This signal can be used in the
differential expression analysis after a possible normalization step. 
The \texttt{gTotalProbeSignal} is the robust average
of all the processed signals for each replicated probe multiplied by the total
number of probe replicates. These signals are used by AFE algorithms to estimate
the \texttt{gTotalGeneSignal}. The \texttt{gMeanSignal} is the raw signal for every probe. These
signals are processed by AFE to obtain the \texttt{gProcessedSignal}. The \texttt{gProcessedSignal}
is obtained after all the AFE processing steps have been completed. Typically it
contains the \texttt{Multiplicatively Detrended BackgroundSubtracted Signal} or the
\texttt{BackgroundSubtractedSignal}. These signals are used by AFE algorithms to estimate
the \texttt{gTotalProbeSignal}. The \texttt{gBGMedianSignal} is the median raw signal of the 
local background calculated from intensities of all inlier pixels that represent
the local background of the feature. The \texttt{gBGUsed} is the background signal computed 
by AFE algorithms. Usually the \texttt{gBGUsed} is the sum of the local background plus 
the spatial detrending surface value computed by the AFE software. The spatial 
detrend surface value estimates the noise due to a systematic gradient on the 
array and it is estimated using the loess algorithm.

AFE attaches to each feature a flag that identifies different quantification
errors of the signal. These quantification flags can be used to filter out
signals that do not reach a minimum established criterion of quality.
\texttt{gIsGeneDetected} is a Boolean variable that informs if the gene was detected on
the microRNA microarray. This flag considers a probe detected if the signal is
three times the error. If one probe of the set of probes comprising a gene is
detected, the gene is considered detected,(1 = IsDetected  0 = IsNotDetected).
\texttt{gIsSaturated} is Boolean flag. 1 indicates that the feature is saturated, i.e.
at least half of the inlier pixels in the feature have intensities above the
saturation threshold. gIsFeatureNonUnifOL is Boolean flag. 1 indicates that the
feature is a non-uniformity outlier; the measured feature pixel variance is
greater than the expected feature pixel variance plus the confidence interval.
\texttt{gIsFeatPopOL} is Boolean flag. 1 indicates that the feature is a population
outlier.

Finally, the \texttt{dd.micro\$targets} contain the name of the files equal to those in
target file. 


\section{Plotting Functions}

\Rpackage{AgiMicroRna} includes functions to create boxplots, density plots, MA plots, 
\texttt{Relative Log Expression} (RLE) \cite{Bold3},
and hierarchical cluster of samples that can assist the user in assessing the quality 
of the data as well as in checking the performance of the processing steps.

All these graphical utilites are included in the \Rfunction{qcPlots} wrapper function.   
\Rfunction{qcPlots} can be called using different intensity signals. For the 
\texttt{gMeanSignal}, the  boxplots, density plots, MA plots, RLE plots
and hierachical clustering plots are produced. For the \texttt{gProcessedSignal} the same
plots are done, except the hierarchical clustering. For the \texttt{gTotalProbeSignal}
and the \texttt{gTotalGeneSignal} only the  boxplots and density plots are done, and
finally, for the background signals only the boxplots are done.


The MA plots represent the fold-change (M) in the y-axis against the average log expression (A) for 
two given arrays. To reduce the number of pairwise comparison MA plots displayed, 
\Rfunction{qcPlots} uses a reference array to which the rest of arrays are compared. The signal 
values of the reference array are computed as the median spots taken over the whole set of arrays
Every kind of feature is identified with different color.
of feature is identified with different color.


The RLE plot displays for each sample a Boxplot with the \texttt{Relative
Log Expression} (RLE) \cite{Bold3}. 
The RLE is computed for every spot in the array as the difference between 
the spot and the median of the same spot across all the arrays. If the majority of 
the spots are expected not to be differentially expressed, the boxplots should be 
centered on zero and all of them with approximately the same dispersion. 


\Rfunction{qcPlots} makes a hierarchical cluster of the samples using
the \Rfunction{hclust} function of the \Rpackage{stats} package.
We can make a hierarchical clustering of samples either using the whole set of genes or 
using for instance only the 100 genes that show the highest variability across arrays.
The options for the distance measures are \texttt{euclidean} and \texttt{pearson}. 
The variables that distinguish the experimental conditions from one another are the
differential expressed genes, and that the number of genes may be few relative to the 
full set of genes of the data set, and hence the cluster analysis will often 
not reflect the influence of these relevant genes. Therefore if the percentage of differential 
expression is low, the samples might not be grouped according 
to their experimental group, since the whole set of genes has very little information 
regarding the experimental grouping, and the plot will mainly show other 
grouping features or simply random noise. 
 

A typical call to the \Rfunction{qcPlots} using the Mean Signal intensity is like: 

\begin{center}
<<qcPlots,eval=TRUE>>=
qcPlots(dd.micro,offset=5,
        MeanSignal=TRUE,
        ProcessedSignal=FALSE,
        TotalProbeSignal=FALSE,
        TotalGeneSignal=FALSE,
        BGMedianSignal=FALSE,
        BGUsed=FALSE,
        targets.micro)

@
\end{center}


The same plots can be also generated calling the corresponding 
functions individually. Next we show how to use these functions 
using the \texttt{gMeanSignal}. Recall, that the \texttt{gMeanSignal} is 
stored in \texttt{dd.micro\$meanS}. 

\begin{center}
<<BoxPlot,fig=TRUE>>=
boxplotMicroRna(log2(dd.micro$meanS),
	maintitle='log2 Mean Signal',
	colorfill= 'orange')
@
\end{center}
\clearpage 

\begin{center}
<<plotDensity,fig=TRUE>>=
plotDensityMicroRna(log2(dd.micro$meanS),
	maintitle='log2 Mean Signal')
@
\end{center}
\clearpage 


\begin{center}
<<MA,fig=TRUE>>=
ddaux=dd.micro
ddaux$G=log2(dd.micro$meanS)
mvaMicroRna(ddaux,
	maintitle='log2 Mean Signal',
	verbose=FALSE)
rm(ddaux)

@
\end{center}
\clearpage 


\begin{center}
<<RLE,fig=TRUE>>=
RleMicroRna(log2(dd.micro$meanS),
	maintitle='log2 Mean Signal - RLE')
@
\end{center}
\clearpage 


\begin{center}
<<hclust,fig=TRUE>>=
hierclusMicroRna(log2(dd.micro$meanS),targets.micro$GErep,
		methdis="euclidean",
                methclu="complete",
		sel=TRUE,100)
@
\end{center}



\section{Array reproducibility}

In the Agilent microRNA platforms 
normally each microRNA gene is interrogated
by 16 probes either using 2 different probes, each
of them replicated 8 times, or using 4 different probes replicated 4 times.
The probe level replication allows the computation of the 
coefficient of variation (CV) for each array. 

The \Rfunction{cvArray} identifies the replicated non-controlprobes for each feature in 
the array and computes CV for every microRNA probe set. Then, the median of the CV for each 
probe set is reported as the array reproducibility. A lower CV median indicates
a better array reproducibility. 

A set of boxplots shows the CV at a probe level for each array. 

To obtain the CV using the \Rfunction{cvArray} function, we can either choose the 
MeanSignal or ProcessedSignal.

<<cvArray,fig=TRUE>>=
cvArray(dd.micro,
        "MeanSignal",
        targets.micro,
	verbose=TRUE)

@


\section{Total Gene Signal}

Normally, in the Agilent platfomrs, each microRNA is interrogated by 
16 probes either using 2 different probes, each
of them replicated 8 times, or using 4 different probes replicated 4 times. 
In \Rpackage{AgiMicroRna} the processed gene signal that is going to 
be analyzed can be obtained using two different protocols. 
The first uses the \texttt{gTotalGeneSignal} (TGS) computed by the AFE algorithm \cite{AFE}
and stored in the \Robject{uRNAList} (\texttt{dd.micro\$TGS}) after reading the data. 
The second option obtains an estimate of the normalized gene signal using the
RMA algorithm \cite{RMA}. The RMA method is explained in the next section.

The function \Rfunction{tgsMicroRna} creates an \Robject{uRNAList} containing the
\texttt{gTotalGeneSignal} processed by the AFE software. This signal might be used for making 
statistical inferences after a possible normalization step. 
The TGS processed by AFE \cite{AFE} may contain negative values. To obtain
signals with positive values we can either add a positive small constant (offset) to all
the signals  or we can select the half option in
\Rfunction{tgsMicroRna}, which set to 0.5 all the values smaller than
0.5. To use the \texttt{offset} option we have to set \texttt{half=FALSE}, otherwise the half
method is used by default. The offset option, adds to each signal the quantity
\Robject{(abs( min(ddTGS\$TGS)) + offset)}, where \Robject{ddTGS\$TGS} is the matrix that contains the
\texttt{gTotalGeneSignal}.

<<tgsMicroRna,eval=TRUE>>=
ddTGS=tgsMicroRna(dd.micro,
    half=TRUE,
    makePLOT=FALSE,
    verbose=FALSE)

@

Finally, \Rfunction{tgsMicroRna} stores the estimated TGS in the four fields \Robject{ddTGS\$TGS} 
\Robject{ddTGS\$TPS}, \Robject{ddTGS\$meanS} and \Robject{ddTGS\$procS}. Be aware that 
this TGS is not in log2 scale. 

\section{Normalization between arrays}

To make direct comparisons of data coming from different chips it is important
to remove sources of variation of non biological nature that may exists between
arrays. Systematic non-biological differences between chips become apparent in
several obvious ways especially in labeling and in hybridization, and bias the
relative measures on any two chips when we want to quantify the differences in
treatment of two samples. Normalization is the attempt to compensate for
systematic technical differences between chips, to see more clearly the
biological differences between samples.

\Rpackage{AgiMicroRna} uses two strategies to obtain a gene signal 
estimate normalized between arrays. The first simply uses the TGS
signal processed by the AFE algorithms \cite{AFE} as it was described in
the last section. This TGS can be normalized between arrays using   
either the \texttt{quantile} (default) \cite{quant1},\cite{quant2} or the \texttt{scale} methods. 
\Rpackage{AgiMicroRna} incorporates the \Rpackage{limma} function \Rfunction{normalizeBetweenArrays}
with options ('none','quantile','scale') \cite{limma}, \cite{Norm}. If we use the  
\texttt{none} option the TGS is only log2 transformed. 


<<tgsNormalization,eval=TRUE>>=
ddNORM=tgsNormalization(ddTGS,
                "quantile",
                makePLOTpre=FALSE,
                makePLOTpost=FALSE,
                targets.micro,
		verbose=TRUE)
@

\Rfunction{tgsNormalization} creates an \Robject{uRNAList} object containing the 
normalized total gene signal in log 2 scale.  
If we set makePLOTpre = TRUE and makePLOTpost = TRUE, a density
plot, a boxplot, and MA plot, a
Relative Log Expression plot (RLE) \cite{Bold3} and a hierarchical clustering
plot are constructed using the data before and after normalization, respectively.

The second alternative implemented in \Rpackage{AgiMicroRna} to obtain an estimate
of the normalized signal for each microRNA uses the robust multiarray
average (RMA) method \cite{RMA} implemented in the \Rpackage{affy} package \cite{affy}. 
RMA obtains an estimate of the expression measure for each gene using all the probe measures 
for that gene. First, the signal is background corrected (optional) by fitting a normal + exponential 
convolution model to the vector of observed intensities. The normal part represents
the background and the exponential part represents the signal intensities \cite{RMA}. Then the 
arrays are normalized (optional) using the \texttt{quantile} normalization \cite{quant1} ,\cite{quant2}. Finally,
an estimate of the microRNA gene signal is obtained fitting a linear model to , 
log2 transformed probe intensities. This model produces an estimate
of the gene signal taking into account the probe effect. The model parameters estimates are obtained
by the median polish algorithm.  

The \Rfunction{rmaMicroRna} is a wrapper function that incorporates
different function to obtain an RMA estimate for the signal of each microRNA. 
First, the \Rfunction{rmaMicroRna} 
the signal can be background corrected using the \Rfunction{rma.background.correct} function 
of the \Rpackage{preprocessCore} \cite{preprocessCore}, then the signal may be normalized between 
arrays using the \Rpackage{limma} function \Rfunction{normalizeBetweenArrays} \cite{limma}, \cite{Norm},
with the \texttt{quantile} method \cite{quant1} ,\cite{quant2}. Then, the median of the replicated probes
is obtained, leading to either 2 or 4 different measures for each gene. These measures correspond to different probes
measure for the same gene. These probe measure intensities are log2 transformed and then  
summarized into a single gene measure 
by the \Rfunction{rma\_c\_complete\_copy} of the \Rpackage{affy} package \cite{affy}.

<<rmaMicroRna,eval=FALSE>>=
ddTGS.rma=rmaMicroRna(dd.micro,
		normalize=TRUE,
		background=TRUE)

@

Finally, \Rfunction{rmaMicroRna} stores the estimated RMA signal in the four fields \Robject{ddTGS.rma\$TGS} 
\Robject{ddTGS.rma\$TPS}, \Robject{ddTGS.rma\$meanS} and \Robject{ddTGS.rma\$procS}. Be aware that 
this estimated signal estimated by function{rmaMicroRna} is now in log2 scale.  
To get some plots with the gene signal processed by RMA, we can use
the code of the following example. 

\begin{Sinput}
> MMM=ddTGS.rma$meanS 
> colnames(MMM)=colnames(dd.micro$meanS)
> maintitle='TGS.rma'
> colorfill='blue'
> ddaux=ddTGS.rma
> ddaux$meanS=MMM	   
> mvaMicroRna(ddaux,maintitle,verbose=TRUE)
> rm(ddaux)
> RleMicroRna(MMM,"RLE TGS.rma",colorfill)
> boxplotMicroRna(MMM,maintitle,colorfill)
> plotDensityMicroRna(MMM,maintitle)
\end{Sinput}

In the code above, be aware that \Rfunction{mvaMicroRna} plots by default whatever is 
contained in \Robject{ddaux\$meanS}, so first, we create this \Robject{ddaux} object
and then, we stored in \Robject{ddaux\$meanS} the matrix we want to use. 
After the total gene signal have been obtained, either extracting the TGS produced 
by AFE, or using the RMA algorithm, we can filter out the signals using the
FLAGS attached to them by the AFE algorithm. 

\section{Filtering Probes}

The AFE attach to each feature a flag that identifies different quantification
errors of the signal that can be used to filter out the microRNAs that do not reach
a minimum of quality. This filtering is normally done after the Total Gene
Signal has been normalized. Different filtering criteria can be established to
be more or less demanding. With low number of replicates probably we need to set
more restrictive filtering limits.

The steps that are currently implemented in \Rfunction{filterMicroRna} are to 
remove control features, to remove gene that are not detected (gIsGeneDetected = 0), 
and to filter out microRNAs which expression do not reach a certain level based on the 
expression of the negative controls.

Basically, the gIsGeneDetected filtering removes microRNAs that are not expressed
in any experimental condition. To do that, for a given feature xi across samples,
we set limit that is the minimum \% of features that they must remain in at least
one experimental condition with a gIsGeneDetected-FLAG = 1 (Is Detected).

Additionally, if we want to be more demanding, we can remove the microRNAs whose
\Robject{gMeanSignal} expression are close to the expression of the negative control features. As
before se set a limit for the \% of microRNAs that they must above a Limit expression
value, in at least, one of the experimental conditions. The Limit expression value
is defined as Mean(negative control expression)  + 1.5 x SD(negative control
expression)

The function returns a \Robject{uRNAList} containing the FILTERED data. In order to allow
the tracking of features that may have been filtered out from the original raw
data, the following files are given:

\texttt{NOCtrl\_FlagIsGeneDetected.txt}: IsGeneDetected Flag for the Non Control Genes, 1 = detected
\texttt{IsNOTGeneDetected.txt}: Genes that do not are not detected according to IsGeneDetected Flag

To continue the with the illustration we use the total gene signal estimated by 
the Agilent Feature Extraction software and normalized by \texttt{quantile} (\texttt{ddNORM}). 
We could have use the \texttt{ddTGS.rma} obtained by RMA as well.  

<<filterMicroRna, eval=TRUE>>=

ddPROC=filterMicroRna(ddNORM,
                    dd.micro,
                    control=TRUE,
                    IsGeneDetected=TRUE,
                    wellaboveNEG=FALSE,
                    limIsGeneDetected=75,
                    limNEG=25,
                    makePLOT=FALSE,
                    targets.micro,
		    verbose=TRUE,
                    writeout=FALSE)
@

\section{creating an ExpressionSet object}

After all the processing steps the \Rfunction{esetMicroRna} creates an
\Robject{ExpressionSet} object  \cite{Gentleman} that can be used for the statistical
analysis. If we set makePLOT=TRUE some plots are displayed (Heatmaps, PCAs)

<<esetMicroRna, eval=TRUE>>=
esetPROC=esetMicroRna(ddPROC,
          targets.micro,
          makePLOT=FALSE,
	  verbose=TRUE)
@


The processed data can be written in an output file using the function
\Rfunction{writeEset}


<<writeEset, eval=FALSE>>=
writeEset(esetPROC,
            ddPROC,
            targets.micro,
	    verbose=TRUE)
@


\section{Differential expression analysis using LIMMA}

The esetPROC contains the Total Gene Signal for every microRNA that have passed
the filtering process, basically, those microRNAs that are expressed in at least
one of our experimental conditions. This signals are used to look for the
microRNAs that are differentially expressed between our experimental conditions.
A linear model is fitted to each microRNA gene so that the fold change between
different experimental conditions and their standard errors can be estimated.
Empirical Bayes methods are applied to obtain moderated statistics. The
functions of the \Rpackage{limma} \cite{limma} are used to that end.


\subsection{Linear Model}

In our case, we had a paired design (by subject) and we want to compare
treatment B and C vs. a control treatment A. The linear model that we are going
to fit to every microRNA is going to be  \texttt{y = Treatment + Subject + error term}.
This model is going to estimate the treatment effect. Then, the comparison of
interest are done by establishing contrasts between the estimates of the
treatment effects.

First, we define the factors that we are going to use in our model formula:
treatment and subject. We use the structure defined in the \texttt{targets.micro} data
frame. One way of doing this could be this way:

<<>>=
levels.treatment=levels(factor(targets.micro$Treatment))
        treatment=factor(as.character(targets.micro$Treatment),
                levels=levels.treatment)

@
<<>>=
levels.subject=levels(factor(targets.micro$Subject))
        subject=factor(as.character(targets.micro$Subject),
                levels=levels.subject)
@

\subsection{Fitting the model. Design and Contrast Matrices}

To fit the model, we  need to define a \texttt{design matrix}. The \texttt{design matrix} is an
incidence matrix that relates each array/sample/file to its given experimental
conditions, in our case, relates each file to one of the three treatments and
to the subjects (paired design by subject). We can define the \texttt{design matrix} using
\Rfunction{model.matrix}\texttt{(~ -1 + treatment + subject)}. This will specify a model without intercept
that will give us the estimates for: treatmentA, treatmentB, treatmentC and subject2.
\texttt{R} use by default the \texttt{contr.treatment} parameterization, which means that each level
is compared with its baseline level (which is set to 0 to avoid singularities).
If we specify a model without intercept, no singularities are produced for
the treatment factor, so we get estimates for all its levels. For the subject
factor (two levels), we only get an estimate for the subject 2, which is interpreted as the
difference with respect to subject1 (baseline = 0).


We can specify the design matrix as:

<<>>=
design=model.matrix(~ -1 + treatment + subject)
print(design)
@

Then the linear model can be fitted using \Rfunction{lmFit} from \Rpackage{limma}
\cite{limma}. This will get the treatment estimates for each microRNA in the
\Robject{ExpressionSet} object.

<<>>=
fit=lmFit(esetPROC,design)
names(fit)
print(head(fit$coeff))
@

To compare A vs. B and A vs. C, we can define the contrasts of interest using a
contrast matrix as in

<<>>=
CM=cbind(MSC_AvsMSC_B=c(1,-1,0,0),
        MSC_AvsMSC_C=c(1,0,-1,0))
print(CM)
@

And then, we can estimate those contrasts using \Rfunction{contrasts.fit} from
\Rpackage{limma} \cite{limma}.


<<>>=
fit2=contrasts.fit(fit,CM)
names(fit2)
print(head(fit2$coeff))
@

Finally, we can obtain moderated statistics using \Rfunction{eBayes} from
\Rpackage{limma} \cite{limma}, \cite{ebayes}.

<<>>=
fit2=eBayes(fit2)
names(fit2)
@

The function \Rfunction{basicLimma} implemented in \Rpackage{AgiMicroRna} gathers all the
last steps in a function to produces the last \Robject{MarrayLM} \Robject{fit2} object.

This \Robject{MarrayLM} object includes, amongst other things, the log fold change of the
contrasts (M value stored in \Robject{fit2\$coeff}), along with its moderated-t statistic
(stored in \Robject{fit\$t}) and its corresponding p value (in \Robject{fit2\$p.value}).
Be aware that to obtain correct inferences these p values must be corrected 
by some multiple testing procedure.

<<basicLimma, eval=FALSE>>=
fit2=basicLimma(esetPROC,design,CM,verbose=TRUE)
@

\subsection{Selecting significant microRNAs}

\Rfunction{getDecideTests} uses the \Rfunction{decideTests} function from
the \Rpackage{limma} package \cite{limma} to classify the list of genes as up,
down or not significant, taking into account the multiplicity of the tests.
\Rfunction{getDecideTests} summarizes the number of up and down genes for
every contrasts according to the specified p value threshold. When we have
multiple contrasts, the significant genes can be selected using different strategies
that are specified by \texttt{DEmethod}. In \Rfunction{getDecideTests} only
\texttt{separate} and \texttt{nestedF} options are implemented. See \Rfunction{decideTests} in
\Rpackage{limma} \cite{limma}


<<getDecideTests, eval=TRUE>>=
DE=getDecideTests(fit2,
        DEmethod="separate",
        MTestmethod="BH",
        PVcut=0.10,
	verbose=TRUE)
@

\Rfunction{pvalHistogram} depicts a histogram of the p values.
For multiple contrasts, the function creates a histogram for every t.test p value (separate)
or a single histogram for the \texttt{F.test pvalue} (\texttt{nestedF}). The histograms of the
pvalues give us an idea about the amount of differential expression that we
have in our data. A uniform histogram will indicate no differential expression
in the data set, whereas a right skewed histogram, will indicate some significant
differential expression

<<pvalHistogram,eval=FALSE>>=
pvalHistogram(fit2,
        DE,
        PVcut=0.10,
        DEmethod="separate",
        MTestmethod="BH",
        CM,
	verbose=TRUE)
@

The function \Rfunction{significantMicroRna} summarizes the results of
the differential expression analysis extracting information from the
\Robject{MArrayLM} and \Robject{TestResults} objects generated by the \Rpackage{limma}
functions. \Rfunction{significantMicroRna} creates a list containing the
genes with their relevant statistics. The significant genes above the \texttt{PVcut}
p values are also given in a html file that links the selected microRNAs to the
\texttt{miRBase} \url{http://microrna.sanger.ac.uk/}. MA plots highlighting the
differentially expressed genes are also displayed.

<<significantMicroRna, eval=FALSE>>=
significantMicroRna(esetPROC,
        ddPROC,
        targets.micro,
        fit2,
        CM,
        DE,
        DEmethod="separate",
        MTestmethod="BH",
        PVcut=0.10,
        Mcut=0,
	verbose=TRUE)
@

When multiple contrasts are done, the method for the selection of the significant genes
can be either \texttt{separated} or \texttt{nestedF}. See \Rfunction{decideTests} in
\Rpackage{limma} \cite{limma} for a detailed description on these two methods.
When \texttt{separated} is plugged in into the \Rfunction{significantMicroRna} function
then a list including all the genes that have been analyzed is given for each contrast.  
These lists contain the statistics obtained from the differential expression analysis.  
The information that is given for each gene is shown in Table 3. 

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em variable & \em data \\
\hline
PROBE & Probe name (one of the probes interrogating the gene) \\
GENE & microRNA name \\
M & Fold change \\
A & Mean of the intensity for that microRNA \\
t & moderated t-statistic \\
pval & p value of the t-statistic \\
adj.pval & p value adjusted by \texttt{MTestmethod} \\
fdr.pval & p value adjusted by fdr \\
\hline
\end{tabular}
\label{tab:Flag3}
\caption{Statistics given by \texttt{significantMicroRna} for \texttt{separate}}
\end{center}
\end{table}


The html output files with links to the \texttt{miRBase} \url{http://microrna.sanger.ac.uk/}
only includes the microRNAs that have been declared as significant. 
Sometimes when we correct by multiple testing we cannot declare any single gene 
as differentially expressed so the html files are not created. Despite of the fact that
no gene is been declared significant, the user might be interested in having a link to the
\texttt{miRBase} for the top ranked differentially expressed genes. When this happens,  
we can be set \texttt{MTestmethod = none}. Since the \texttt{adj.pval} is the p value adjusted
by the \texttt{MTestmethod = none}, then it will be the p value without any correction. 
in this case, it might be interesting to have the fdr value, despite of the fact that the user has
decided not apply any multiple testing correction. That is why, an additional column \texttt{fdr.pval}
is included containing the p values corrected by fdr, independent of the method that
we have specify in \texttt{MTestmethod}. Be aware that a multiple testing correction
should be used.  

If the \texttt{nestedF} is used, then two lists are printed out for each contrast. A first
set of lists containing the selected significant genes, and a second group of lists containing the rest
of the genes that have been analyzed and they do not have been declared significant.
The columns given in this case in the output files are shown in Table 4. 

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em variable & \em data \\
\hline
PROBE & Probe name (one of the probes interrogating the gene) \\
GENE & microRNA name \\
M & Fold change \\
A & Mean of the intensity for that microRNA \\
t & moderated t-statistic \\
F & F statistic (null hypothesis: Ci = Cj, for all contrasts i, j) \\
t pval & p value of the t-statistic \\
adj.F.pval & F p value adjusted by \texttt{MTestmethod} \\
fdr.F.pval & F p value adjusted by fdr \\
\hline
\end{tabular}
\label{tab:Flag4}
\caption{Statistics given by \texttt{significantMicroRna} for \texttt{nestedF}}
\end{center}
\end{table}

\clearpage 
\bibliographystyle{plain}
\bibliography{AgiMicroRna}
\end{document}