%\VignetteIndexEntry{ABarray gene expression} %\VignetteKeywords{Applied Biosystems AB1700} %\VignetteDepends{ABarray} %\VignettePackage{ABarray} \documentclass[10pt]{article} \title{AB1700 Microarray Data Analysis} \author{ Yongming Andrew Sun, Applied Biosystems \\ sunya@appliedbiosystems.com } \usepackage{graphicx} \usepackage{fancyhdr} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\textit{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \oddsidemargin -0.2in \evensidemargin 0.0in \textwidth 6.9in \topmargin -0.75in \textheight 9.25in \pagestyle{fancy} \rhead{AB1700 Gene Expression} \rfoot{ABarray package} \begin{document} \setkeys{Gin}{width=0.6\textwidth} \maketitle \tableofcontents \section{ABarray Package Introduction} The package \Rpackage(ABarray) is designed to work with Applied Biosystems whole genome microarray platform, as well as any other platform whose data can be tranformed into expression data matrix. In the following functions described, items 1 and 2 are specific to AB1700 datasets. Items 3 to 9 can be applied to any datasets. However, for AB1700 data, filtering is performed using S/N ratio threshold (default = 3), while filtering is not applied to other data. The package \Rpackage(ABarray) will perform the following functions: \begin{enumerate} \item Read output from AB1700 software output (AB1700 specific) \item Perform analysis on hybridization control spike-ins (AB1700 specific) \item Perform raw data QA and associated plots (see \Rfunction{doPlotEset}) including: \begin{itemize} \item boxplot for signal distrubution range \item MA plot for signal distribution and signal variability \item CV plot for variation among hybridization replicates \item Scatter plot for correlation between hybridization arrays \item Correlation heatmap for visualization \item S/N detection concordance \end{itemize} \item Perform quantile normalization \item Repeat data QA after normalization. See \Rfunction{doPlotEset} \item Perform t test, fold change and ANOVA and produce graphics to visualize t test results. The default t test assumes unequal variances. See \Rfunction{doPlotFCT} for more details. \item Peform LPE for low number of replicates. See \Rfunction{doLPE} for more details \item For more details on ANOVA analysis, see \Rfunction{doANOVA} \item For drawing Venn diagram, see \Rfunction{doVennDiagram} \end{enumerate} \subsection{Required Files and Format} To take full advantage of automated process provided by package \Rpackage{ABarray}, two files should be available before running \Rpackage{ABarray}: a data file and an experiment design file. \begin{itemize} \item Experiment Design File. The rows of the file are samples or arrays. The first column should be sampleName. Perhaps, sampleName should be concise and no spaces between characters. Additional columns maybe assayName and arrayName (one of these). Additional columns should specify what type of samples. Note: It is best to have assayName the same as in dataFile. See an example of experiment design file (Table~\ref{tab:design1}). \begin{table}[h] \begin{center} \caption{Experiment design and sample information.} \label{tab:design1} \begin{tabular}{rlll} \hline & sampleName & assayName & tissue \\ \hline 1 & A1 & HB003U2\_9/2/05\_1:55\_PM & TissueA \\ 2 & A2 & HB003U3\_9/2/05\_2:11\_PM & TissueA \\ 3 & A3 & HB003U8\_9/2/05\_2:26\_PM & TissueA \\ 4 & B1 & HB003TV\_9/2/05\_2:41\_PM & TissueB \\ 5 & B2 & HB003TW\_9/2/05\_2:55\_PM & TissueB \\ 6 & B3 & HB003TZ\_9/2/05\_3:10\_PM & TissueB \\ \hline \end{tabular}\end{center} \end{table} In the example (Table~\ref{tab:design1}), assayName is included and arrayName is not. There is no need for both assayName and arrayName to be listed in the design file. Note: every column name in the header is ONE word, there needs NO spaces in each word. Make sure sampleName, assayName, arrayName spelled correctly. There is no need to worry about lower case or upper case. \item Data File. The rows of the file represent probes. The first column is probeID (or Probe Name), the second column is geneID, next set of columns should contain Signal, S/N and FLAGS for all samples. Optionally columns with Assay\_Normalized\_Signal, SDEV, CV can be included in the data file, but these columns will be ignored in the analysis. The columns are: probeID, geneID, Signal, S/N, Flags, and optionally SDEV, CV, and Assay\_Normalized\_Signal. \item Control Probes. It is optional to have control probes. If they are present, plots will be generated for the control probes, but they will be removed for further analysis. \end{itemize} \subsection{Data File and Experiment Design File} The data file and experiment design file are essential. They should be either tab-delimt txt file or comma-delimit csv file. It is very important to make these files compliant. If array name in the header of data file is present and arrayName is defined in experiment design file, arrayName will be used to distinguish which column is for which hybridization sample. In the absence of arrayName in experiment design file, assayName will be used to identify which column in data file is for which hybridization sample. It is important to spell assayName in the experiment design file the same as in data file. For example, if the assayName in data file is called \Rfunarg{HA004J7\_1A\_2}, which will show as \Rfunarg{Signal HA004J7\_1A\_2} in data file, the assayName in experiment design file should be called exactly as \Rfunarg{HA004J7\_1A\_2}. Note: assayName should be unique and not part of another assayName. For example, assayName S1 is part of another assayName S11. In this case, S1 is better renamed to S01. Sometimes, there could be more assays in data file than we actually want to analyze. For example, we decided to ignore one or two of the arrays, because the images are bad and there are reasons to exlude these arrays. But instead to delete the related columns in data file, we simply change the experimental design file to just include those arrays we need to analyze. However, on the other hand, if there are more hybridization samples in experiment design file than in data file, it will produce an error, as this does not make sense. \subsection{ABarray Function Details} \subsubsection{Processing of control probes} There are a number of internal control probes for AB1700 Expression Array Systems. The QC analysis is performed on the following controls before any data normalization. \begin{enumerate} \item Hybridization Controls. \item IVT Labeling Controls. \item RT Labeling Controls. \item Negative Controls. \end{enumerate} All other controls are ignored in the analysis. The behavior of signals of the controls are plotted in the figures. The control probes and associated measurements are removed before data normalization and transformation. \subsubsection{Data normalization and transformation} Once the control probes and associated measurements are removed, the quantile normalization procedure is applied. The data is then transformed into log2 scale. If the imputation is selected (default is to use average imputation), the missing value imputation is applied after quantile normalization and log2 transformation. What is considered missing value? If the \textit{Flag} value of a probe is above 5000 (actually above $2^{12}= 4096$), the signal value of the probe is considered missing value and is replaced with NA. The imputation is then performed on these NA values. If 'No imputation' is used, the probe signal values are used as is (they are not replaced with NA, nor are they imputed) in the downstream analysis. There are 2 choices in the program for imputation. \begin{enumerate} \item No imputation. Imputation is not performed, and the signal values are used as is. \item Average imputation. The probe signal values are averaged from other arrays in the same subgroup. The missing values are replaced with the averaged probe. If there is not enough values to average for certain probes, the signal values of these probes remain missing (NA). \end{enumerate} \subsubsection{Array QC Analysis} A number of array QC analysis is performed and associated figures are generated. The QC analysis is performed on both the pre-normalized data and normalized data. \begin{itemize} \item boxplot \item MA plot \item scatter plot \item correlation plot \item CV plot \item percentage of probes detected \item probe detection concordance \end{itemize} \subsubsection{Statistical Analysis for Differential Expression} A default two sample \textit{t} test is performed between any 2 subgroups. A probe filtering procedure precedes the \textit{t} test. If a probe is not detected (S/N < 3) in both subgroups, it is not included in the \textit{t} test. If a probe is not detected in more than 50\% of samples in a subgroup, it is not detected in that subgroup. The 50\% is default, and can be set to a different threshold value. The fold change (FC) and log2(FC) for the detected probes are also calculated. The FC is divided into 5 FC bins. In addition, the \textit{p} values from the \textit{t} test are used to calculate adjusted \textit{p} value as FDR using Biocondutor package \Rpackage{multtest}. The FDR is calculated using Benjamini-Hochberg procedure. The results are written into files and plotted into MA and volcano plots. If there are more than 2 subgroups, ANOVA will also be performed. The probe detection threshold (same as \textit{t} test) is also applied. \subsubsection{Clustering heatmap} The clustering heatmap is also produced if the number of differentially expressed probes is less than 800. \subsubsection{Description of Results} There are 3 folders generated after the analysis for each group. \begin{itemize} \item DataResult. This folder contains several text files. \begin{itemize} \item 'ControlRawData.csv' contains signals of control probes. \item 'ProbeImputed\_' contains probeID whose expression measurement may be imputed. \item 'QuantileNormalizedExpression' contains quantile normalized, log2 transformed, and imputed values. \item 'RawExpressionData.csv' contains non-normalized, non-imputed raw measurement. \item 'ST\_DetectPct' files contain statistical analysis results. \end{itemize} \item jpgFigure. This folder contains figures in jpg or bitmap format. \item pdfFigure. This folder contains figures in pdf format. \end{itemize} \section{ABarray Function Demonstration} After loading the data file and experiment design file, the ABarray functions check to see if control signals are present in data file. If they are present, several plots for control assesement will be produced. The program then removes these control probes and perform QA analysis on raw data. It also produces a number of plots to visualize those QA assesement. After this, it performs quantile normalization and repeat QA assessment for quantile normalized data. Futhermore, fold changes between subgroups and \textit{t} test to identify differential expressed genes will be performed (and ANOVA if there are more than 2 subgroups in the group). \subsection{Required Libraries} The bioconductor library \Rpackage{Biobase} and \Rpackage{multtest} are required to run ABarray. In addition, several other optional libraries may be needed to perform various functions. \Rpackage{impute} is needed if KNN imputation will be performed, \Rpackage{limma} is needed if drawing Venn diagram function is performed. \subsection{Preparing R for Data Analysis} After launching R, we need to set the R working directory to inform R where we find data file and design file. It also inform R where to store the analysis results and associated figures from the analysis. To set R working directory, choose \textsf{'Change dir...'} from \textsf{'File'} menu, and a dialog box appears to let you browse which directory (folder) to set as working directory (See Figure~\ref{fig:setWorkDir1} and Figure~\ref{fig:setWorkDir2}). \begin{figure}[htp] \begin{minipage}{0.45\textwidth} \includegraphics[width=1\textwidth]{Figure/dirChangeFig.pdf} \caption{Change R working directory. Select Change dir... from File menu.} \label{fig:setWorkDir1} \end{minipage} \hfill \begin{minipage}{0.45\textwidth} \includegraphics[width=1\textwidth]{Figure/dirSelectFig.pdf} \caption{Browse the directory to set as working directory.} \label{fig:setWorkDir2} \end{minipage} \end{figure} \subsection{File List in Working Directory} To see the content of the working directory, type the following command \begin{Schunk} \begin{Sinput} > dir() \end{Sinput} \begin{Soutput} [1] "ABarray_Example.tex" "ExperimentDesign.csv" "RawExpressionABData.txt" \end{Soutput} \end{Schunk} \subsection{Loading ABarray library} The \Rpackage{ABarray} has to be loaded to perform the automated analysis. To load \Rpackage{ABarray}, type the following command after launching R. \begin{Schunk} \begin{Sinput} > library(ABarray) \end{Sinput} \end{Schunk} \subsection{Performing Automated Expression Analysis} The automated gene expression analysis requires only one command. The function takes 3 parameter arguments. The first argument is the name of the data file. The second argument is the name of the expriment design file. The third argument is the name of the experiment group defined in the experiment design file on which t test and other analysis should be performed. It will generate an \Robject{ExpressionSet} expression value object. In the following example, the name of the data file is called 'RawExpressionABData.txt'; the name of the experiment design file is called 'ExperimentDesign.csv'; and the 'tissue' is the group name selected to perform analysis on. The expression value object is assigned to \Robject{eset} object. Since the 'tissue' has 2 subgroup, ANOVA will NOT be performed in this example. \begin{Schunk} \begin{Sinput} > eset = ABarray("RawExpressionABData.txt", "ExperimentDesign.csv", "tissue") \end{Sinput} \begin{Soutput} Using assayName to match experiment with signal in file: RawExpressionABData.txt The sample names are: [1] "A1" "A2" "A3" "B1" "B2" "B3" AssayNames in dataFile: RawExpressionABData.txt [1] "HB003U8_9/2/05_2:26_PM" "HB003U2_9/2/05_1:55_PM" "HB003TZ_9/2/05_3:10_PM" "HB003U3_9/2/05_2:11_PM" [5] "HB003TW_9/2/05_2:55_PM" "HB003TV_9/2/05_2:41_PM" Reading data from RawExpressionABData.txt ...... This may take several minutes.... Checking data format:::::::::::::::::::|| The results will be in the folder: Result_tissue/ [1] "Creating plot for Signal Hybridization_Control ..." [1] "Creating plot for Signal Negative_Control ..." [1] "Creating plot for Signal IVT_Kit_Control_BIOB ..." [1] "Creating plot for Signal IVT_Kit_Control_BIOC ..." [1] "Creating plot for Signal IVT_Kit_Control_BIOD ..." [1] "Creating plot for Signal RT_Kit_Control_DAP ..." [1] "Creating plot for Signal RT_Kit_Control_LYS ..." [1] "Creating plot for Signal RT_Kit_Control_PHE ..." Perform basic analysis for non-normalized data ... [1] "Creating barplot for probes detectable ... QC_DetectableProbeSN3" [1] "Creating boxplot for Raw_tissue ..." [1] "Creating correlation matrix and plot Raw_tissue ..." [1] "Creating detection concordance plot Raw_tissue ..." Perform Average imputation for probes with FLAG > 5000 [1] "Array A1 has 72 FLAGS > 5000" "Array A2 has 89 FLAGS > 5000" "Array A3 has 47 FLAGS > 5000" [4] "Array B1 has 91 FLAGS > 5000" "Array B2 has 96 FLAGS > 5000" "Array B3 has 75 FLAGS > 5000" The signals of each flagged probe were replaced with average signals from replicate arrays within the same subgroup (TissueA, TissueB). The imputed probes were written to file: Result_tissue/DataResult/ProbeImputed_tissue.csv Quantile normalized data was saved to: Result_tissue/DataResult/QuantileNormalizedExpression.csv The following probes still contain missing values even after imputation: [1] "128976" "215914" "226588" "249634" The expression data object was saved to R workspace file: ExperimentDesign_eset_28Mar2006.Rdata Perform data analysis on Quantile normalized data ... [1] "Creating boxplot for Quantile_tissue ..." [1] "Creating MA plot for Quantile_tissue" [1] "Creating MA Plot Quantile_tissue: TissueA ..." [1] "Creating QC_CVplot_Quantile_tissue_TissueA ..." [1] "Creating MA Plot Quantile_tissue: TissueB ..." [1] "Creating QC_CVplot_Quantile_tissue_TissueB ..." [1] "Creating scatter plot filtering S/N < 3 for Quantile_tissue ..." [1] "Creating correlation matrix and plot Quantile_tissue ..." [1] "Creating detection concordance plot Quantile_tissue ..." Performing t test between TissueA and TissueB ... 1 probes had sdev = 0, they were removed in the t test [1] "520680" [1] "Creating volcano plot ..." [1] "Writing t test result to file: Result_tissue/DataResult/ST_DetectPct50FCpvalFDR_TissueA-TissueB.csv" The t test was performed if the probe shows S/N >= 3 in 50% or more samples in either TissueA or TissueB The number of probes after filtering is: 22323 p=0.01, significant probes: 12611 [1] "Creating Fold Change plot for: TissueA-TissueB pVal 0.01 ..." [1] "Clustering was not performed, because there are 12611 probes." [1] " The limit of probe counts for clustering is: 800" p=0.05, significant probes: 16461 [1] "Creating Fold Change plot for: TissueA-TissueB pVal 0.05 ..." [1] "Clustering was not performed, because there are 16461 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.01, significant probes: 10666 [1] "Creating Fold Change plot for: TissueA-TissueB FDR 0.01 ..." [1] "Clustering was not performed, because there are 10666 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.05, significant probes: 15681 [1] "Creating Fold Change plot for: TissueA-TissueB FDR 0.05 ..." [1] "Clustering was not performed, because there are 15681 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.1, significant probes: 17416 [1] "Creating Fold Change plot for: TissueA-TissueB FDR 0.1 ..." [1] "Clustering was not performed, because there are 17416 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.25, significant probes: 19373 [1] "Creating Fold Change plot for: TissueA-TissueB FDR 0.25 ..." [1] "Clustering was not performed, because there are 19373 probes." [1] " The limit of probe counts for clustering is: 800" \end{Soutput} \end{Schunk} \section{Information about QC and Analysis Results} The function \Rfunction{ABarray} produced a lot of results. In addition, it returns an \Robject{ExpressionSet} object. This make it possible to use vast amount of packages and functions in Bioconductor (www.bioconductor.com) for further statistical analysis, e.g., classification, prediction. The results will be saved in a folder within the R working directory. The name of the folder begins with \textsf{Result\_} and the name of the group. In the example run, the fold name is \textsf{Result\_tissue}. There will be 3 directories created inside this folder to store the analysis results: One \textsf{DataResult} folder to store normalized signals, \textit{p} values and FDR from \textit{t} test, and ANOVA \textit{p} values if ANOVA is performed. One \textsf{jpgFigure} folder to store QC and analysis plots, and one \textsf{pdfFigure} folder. In \textsf{pdfFigure}, the plots are the same as in \textsf{jpgFigure}, but in higer resolution of pdf format. Now, let's take a look at the \Robject{eset}. \begin{Schunk} \begin{Sinput} > eset \end{Sinput} \begin{Soutput} Expression Set (exprSet) with 32878 genes 6 samples phenoData object with 3 variables and 6 cases varLabels sampleName: read from file assayName: read from file tissue: read from file \end{Soutput} \end{Schunk} We can also check the content of the experiment design by method in \Rclass{Bioconductor} package \Rpackage{Biobase}, \Rfunction{pData}. \begin{Schunk} \begin{Sinput} > pData(eset) \end{Sinput} \begin{Soutput} sampleName assayName tissue 1 A1 HB003U2_9/2/05_1:55_PM TissueA 2 A2 HB003U3_9/2/05_2:11_PM TissueA 3 A3 HB003U8_9/2/05_2:26_PM TissueA 4 B1 HB003TV_9/2/05_2:41_PM TissueB 5 B2 HB003TW_9/2/05_2:55_PM TissueB 6 B3 HB003TZ_9/2/05_3:10_PM TissueB \end{Soutput} \end{Schunk} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %- Function doPlotEset \section{Data Analysis on eSet object: doPlotEset} In this example, we will show the utility of \Rfunction{doPlotEset}. This function perform a number of data quality assessment and some statistical analysis (see \Rfunction{doPlotFCT} for fold change and t test, \Rfunction{doANOVA} for one- or two-way ANOVA analysis). \subsection{Data Quality Plots} \begin{itemize} \item Boxplot showing distribution of signal range. \item MA plot showing data distribution and signal variability between hybridizations. There might be several MA plots produced. All pairs of hybridizations within the group will be plotted if number of pairs is less than 35 pairs. In addition, MA plots will be produced for each pairs within subgroup of the group to see signal variability between each individual members of the subgroup. \item CV plot showing amount of variation within replicate hybridizations of each subgroup. \item Scatterplot showing visually the correlation between any two hybridizations within the group. Correlation coefficient is also calculated for each pair. \item Correlation heatmap showing correlation between any pairs of hybridization. The heatmap is color coded to quickly visualize the extent of the correlation. \item S/N detection concordance. The signal is detected if the probe shows S/N ratio equal to or above a user defined threshold (default is 3). The S/N detection heatmap shows detection concordance between any pairs of hybridization. \end{itemize} \subsection{Statistical Analysis} The function \Rfunction{doPlotEset} accepts a parameter \Rfunarg{test} either \Rclass{TRUE} (the default) or \Rclass{FALSE}. If \Rfunarg{test} = \Rclass{TRUE} (the default), regular t test will be performed. For details, please refer to \Rfunction{doPlotFCT} in \Rpackage{ABarray} package. If there are more than two subgroups within the group, pairwise t test will be performed on each pairs of subgroup. In addition, one way ANOVA test will also be performed. The t test results are written to files and several graphics plots will be produced. ANOVA test results are written to file as well if it is performed. \subsection{Requred Data and Function Arguments} \begin{itemize} \item An \Robject{ExpressionSet} object. The S/N ratio information is stored in \Robject{assayDataElement(eset, "snDetect")}. This will automatically be created from \Rfunction{ABarray} in \Rpackage{ABarray} package. The analysis will filter probes based on S/N >= 3 in 75\% of samples. If S/N is not available, all probes will be used for analysis (no filtering will be applied). \item An \Rfunarg{group} parameter. The group is a factor that t test should be performed on. The name of the group should correspond to one of the names in experiment design file that define sample type. If there are 3 or more levels for the group, pairwise t test and fold change will be peformed for all possible pairs. \item An optional \Rfunarg{name} parameter. The name will be used for output filenames to distinguish different types of runs. For example, \Rmethod{quantile} to indicate the results were produced after \Rmethod{quantile} process. \Rmethod{raw} to indicate the results were produced with raw data. If \Rfunarg{name} is missing, the analysis and output files will still be produced. \item An optional \Rfunarg{snThresh} parameter. If this is ommitted, default snThresh = 3 will be used. To use a different threshold, use \Rfunarg{snThresh} = \Rfunarg{newValue}. \item Parameter \Rfunarg{test}. By default, \Rfunarg{test} = \Rclass{TRUE}. This indicates to perform statistical test (\textit{t} test and ANOVA if applicable). To avoid these test, use \Rfunarg{test} = \Rclass{FALSE}. For more details about the \textit{t} test and fold change, see \Rfunction{doPlotFCT} function. For more details for ANOVA analysis, see \Rfunction{doANOVA}. \end{itemize} \subsection{doPlotEset Function Demonstration} First let's see what the \Robject{eset} contains \begin{Schunk} \begin{Sinput} > eset \end{Sinput} \begin{Soutput} Expression Set (exprSet) with 32878 genes 6 samples phenoData object with 3 variables and 6 cases varLabels sampleName: read from file assayName: read from file tissue: read from file \end{Soutput} \end{Schunk} Let's check the experiment design and pick a group for testing \begin{Schunk} \begin{Sinput} > pData(eset) \end{Sinput} \begin{Soutput} sampleName assayName tissue 1 A1 HB003U2_9/2/05_1:55_PM TissueA 2 A2 HB003U3_9/2/05_2:11_PM TissueA 3 A3 HB003U8_9/2/05_2:26_PM TissueA 4 C1 HB003UA_9/2/05_3:24_PM TissueC 5 C2 HB003UB_9/2/05_3:44_PM TissueC 6 C3 HB003UD_9/2/05_3:58_PM TissueC \end{Soutput} \end{Schunk} Let's begin to perform the analysis \begin{Schunk} \begin{Sinput} > doPlotEset(eset, "tissue", name = "quantile") \end{Sinput} \begin{Soutput} [1] "Creating boxplot for quantile_tissue ..." [1] "Creating MA plot for quantile_tissue" [1] "Creating MA Plot quantile_tissue: TissueA ..." [1] "Creating CVplot_quantile_tissue_TissueA ..." [1] "Creating MA Plot quantile_tissue: TissueC ..." [1] "Creating CVplot_quantile_tissue_TissueC ..." [1] "Creating scatter plot for quantile_tissue ..." [1] "Creating scatter plot filtering S/N < 3 for quantile_tissue ..." [1] "Creating correlation matrix and plot quantile_tissue ..." [1] "Creating detection concordance plot quantile_tissue ..." Performing t test between TissueA and TissueC ... 2 probes had sdev = 0, they were removed in the t test [1] "207943" "520680" [1] "Creating volcano plot ..." [1] "Writing t test result to file: Result_tissue/DataResult/DetectPct50FCpvalFDR_TissueA-TissueC.csv" The t test was performed if the probe shows S/N >= 3 in 50% or more samples in either TissueA or TissueC The number of probes after filtering is: 21628 p=0.01, significant probes: 3470 [1] "Creating Fold Change plot for: TissueA-TissueC pVal 0.01 ..." [1] "Clustering was not performed, because there are 3470 probes." [1] " The limit of probe counts for clustering is: 800" p=0.05, significant probes: 7595 [1] "Creating Fold Change plot for: TissueA-TissueC pVal 0.05 ..." [1] "Clustering was not performed, because there are 7595 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.01, significant probes: 429 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.01 ..." [1] "Creating cluster heatmap for FDR 0.01 using Correlation ..." [1] "Creating cluster heatmap for FDR 0.01 using Euclidean ..." FDR=0.05, significant probes: 2700 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.05 ..." [1] "Clustering was not performed, because there are 2700 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.1, significant probes: 5577 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.1 ..." [1] "Clustering was not performed, because there are 5577 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.25, significant probes: 10894 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.25 ..." [1] "Clustering was not performed, because there are 10894 probes." [1] " The limit of probe counts for clustering is: 800" \end{Soutput} \end{Schunk} %- End funciton doPlotEset %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %- Function doPlotFCT \section{Fold Change and t test: doPlotFCT} In this example, we will show the utility of \Rfunction{doPlotFCT}. This function perform calculation for fold change and \textit{t} test. The \textit{t} test \textit{p} values are adjusted using Benjamini-Hochberg FDR at preset levels: 0.01, 0.05, 0.1, 0.25, and raw \textit{p} values 0.01 (no FDR) and 0.05 (no FDR). The results are written to a cvs file, whose name begin with DetectPctFCpvalFDR and the group name that the test was performed on. In addition, fold change are plotted into graphics for various FDR levels, and volcano plot of fold change and \textit{t} test \textit{p} values are plotted. The analysis uses S/N ratio for filtering. By default, if a probe is detected in 50\% or more samples within any subgroup, it is included in the analysis. These thresholds can be changed at run time with optional parameters. If there are less than 800 probes for a particular threshold level, clustering heatmap may also be produced. There will be 2 clustering figures, one uses correlation coefficient for clustering, and the other one uses distance for clustering. The \textit{t} test \textit{p} values will be calculated using R implementation of \textit{t} test assuming unequal variances and using log2 transformed normalized data. The fold changes will also be calculated. If a paired \textit{t} test is performed, the fold changes will be average of paired fold changes. If not paired \textit{t} test, the fold changes will be fold changes of averaged expression values. \subsection{Required Data and Function Arguments} \begin{itemize} \item An \Robject{ExpressionSet} object. The S/N ratio information is stored in \Robject{assayDataElement(eset, "snDetect")}. This will automatically be created from \Rfunction{ABarray} in \Rpackage{ABarray} package. The analysis will filter probes based on S/N and percentage of samples detectable within subgroups. If S/N is not available, all probes will be used for analysis (no filtering will be applied). \item An \Rfunarg{group} parameter. The group is a factor that \textit{t} test should be performed on. The name of the group should correspond to one of the names in experiment design file that define sample type. If there are 3 or more levels for the group, pairwise \textit{t} test and fold change will be peformed for all possible pairs. \item An optional \Rfunarg{grpMember} parameter. If this is missing, all members in the group will be used for analysis. If it is defined and passed to the \Rfunction{doPlotFCT} analysis will be performed only on those members within the group. \item{order1}{ optional, For a pairwise comparison the ordering of the first group of replicates} \item{order2}{ optional, For a pairwise comparison the ordering of the second group of replicates} \item{snThresh}{ optional S/N ratio threshold. Default = 3 } \item{detectSample}{ Percentage of samples the probe is detected in order to consider in \textit{t} test. Default = 0.5} \end{itemize} \subsection{doPlotFCT Function Demonstration} First let's see what the \Robject{eset} contains \begin{Schunk} \begin{Sinput} > eset \end{Sinput} \begin{Soutput} Expression Set (exprSet) with 32878 genes 6 samples phenoData object with 3 variables and 6 cases varLabels sampleName: read from file assayName: read from file tissue: read from file \end{Soutput} \end{Schunk} Let's check the experiment design and pick a group for testing \begin{Schunk} \begin{Sinput} > pData(eset) \end{Sinput} \begin{Soutput} sampleName assayName tissue 1 A1 HB003U2_9/2/05_1:55_PM TissueA 2 A2 HB003U3_9/2/05_2:11_PM TissueA 3 A3 HB003U8_9/2/05_2:26_PM TissueA 4 C1 HB003UA_9/2/05_3:24_PM TissueC 5 C2 HB003UB_9/2/05_3:44_PM TissueC 6 C3 HB003UD_9/2/05_3:58_PM TissueC \end{Soutput} \end{Schunk} Let's begin to perform the analysis \begin{Schunk} \begin{Sinput} > doPlotFCT(eset, "tissue") \end{Sinput} \begin{Soutput} Performing t test between TissueA and TissueC ... 2 probes had sdev = 0, they were removed in the t test [1] "207943" "520680" [1] "Creating volcano plot ..." [1] "Writing t test result to file: Result_tissue/DataResult/DetectPct50FCpvalFDR_TissueA-TissueC.csv" The t test was performed if the probe shows S/N >= 3 in 50% or more samples in either TissueA or TissueC The number of probes after filtering is: 21628 p=0.01, significant probes: 3470 [1] "Creating Fold Change plot for: TissueA-TissueC pVal 0.01 ..." [1] "Clustering was not performed, because there are 3470 probes." [1] " The limit of probe counts for clustering is: 800" p=0.05, significant probes: 7595 [1] "Creating Fold Change plot for: TissueA-TissueC pVal 0.05 ..." [1] "Clustering was not performed, because there are 7595 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.01, significant probes: 429 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.01 ..." [1] "Creating cluster heatmap for FDR 0.01 using Correlation ..." [1] "Creating cluster heatmap for FDR 0.01 using Euclidean ..." FDR=0.05, significant probes: 2700 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.05 ..." [1] "Clustering was not performed, because there are 2700 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.1, significant probes: 5577 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.1 ..." [1] "Clustering was not performed, because there are 5577 probes." [1] " The limit of probe counts for clustering is: 800" FDR=0.25, significant probes: 10894 [1] "Creating Fold Change plot for: TissueA-TissueC FDR 0.25 ..." [1] "Clustering was not performed, because there are 10894 probes." [1] " The limit of probe counts for clustering is: 800" \end{Soutput} \end{Schunk} %- End function doPlotFCT %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %- Funciton doANOVA \section{ANOVA analysis} The function \Rfunction{doANOVA} in package \Rpackage{ABarray} is designed to perform one way or two way ANOVA analysis. If only one factor is provided in parameter, one way ANOVA is performed. If two factors are provided, two way ANOVA is performed. The S/N ratio will be used to perform filtering. If a probe is detected (S/N >= 3, default value) in 50\% (default value) or more samples within any subgroups, it is included in the ANOVA analysis. If a probe is not detected in 50\% or more samples in all the subgroups, it is removed from ANOVA analysis. \subsection{Required Data Object} The function \Rfunction{doANOVA} works on either an \Rclass{ExpressionSet} object or expression \Robject{matrix}. If \Rclass{ExpressionSet} object is used, experiment design information should be already in the \Rclass{ExpressionSet} object. If expression \Robject{matrix} is used, rows of the matrix should be probes or genes, and columns of the matrix should be hybridization (or arrays). \begin{itemize} \item \Robject{ExpressionSet} object. The object can be generated from function \Rfunction{ABarray} in package \Rpackage{ABarray}. For additional information about functionality of \Rfunction{ABarray}, refer to document from \Rfunction{ABarray}. The object can also be generated from other packages in \Rpackage{Bioconductor} project. See www.bioconductor.org for more information about other packages. \item factor1. Factor name that the ANOVA test will be performed on. If an \Robject{ExpressionSet} object is used, either name or labels can be used. The name should be present in experiment design informaton in the \Robject{ExpressionSet} object. Function \Rfunction{ABarray} in package \Rpackage{ABarray} automatically creates such information if appropriate experiment design file is used when function \Rfunction{ABarray} was run. If expression matrix is to be used, \Robject{lables} should be used. A \Robject{label} is a vector which labels each hybridization (or array) into groups. For example, a \Robject{label} <- c('drug1', 'drug1', 'drug2', 'drug2', 'none', 'none') indicates samples for arrays 1 and 2 were treated with drug1, and samples for arrays 5 and 6 were not treated. \item optional factor2. If the second factor is provided, two way ANOVA will be performed. \end{itemize} \subsection{doANOVA Analysis Output} If one way ANOVA is performed, a vector with ANOVA p values will be returned. The names of the vector are probeIDs. If two way ANOVA is performed, a matrix will be returned. The rows of the matrix are probeIDs and the columns are the levels of factors and interactions of the levels. The ANOVA results will be saved into file automatically. \subsection{doANOVA Function Demonstration} \subsubsection{Required Data} First let's see what the \Robject{eset} contains \begin{Schunk} \begin{Sinput} > eset \end{Sinput} \begin{Soutput} Expression Set (exprSet) with 33096 genes 12 samples phenoData object with 6 variables and 12 cases varLabels sampleName: read from file assayName: read from file arrayName: read from file pool: read from file IVT: read from file multIVT: read from file \end{Soutput} \end{Schunk} Let's check the experiment design and pick factors for testing \begin{Schunk} \begin{Sinput} > pData(eset) \end{Sinput} \begin{Soutput} sampleName assayName arrayName pool IVT multIVT 1 EL_1A_1 HA004J2_1A_1 HA004J2 pool1 IVT1 IVT1 2 EL_1A_2 HA004J7_1A_2 HA004J7 pool1 IVT1 IVT1 3 EL_1A_3 HA004JB_1A_3 HA004JB pool1 IVT1 IVT1 4 EL_1B_1 HA004JD_1B_1 HA004JD pool1 IVT2 IVT2 5 EL_1B_2 HA004JI_1B_2 HA004JI pool1 IVT2 IVT2 6 EL_1B_3 HA004JK_1B_3 HA004JK pool1 IVT2 IVT2 7 EL_2A_1 HA004JN_2A_1 HA004JN pool2 IVT1 IVT3 8 EL_2A_2 HA004JO_2A_2 HA004JO pool2 IVT1 IVT3 9 EL_2A_3 HA004JP_2A_3 HA004JP pool2 IVT1 IVT3 10 EL_2B_1 HA004JS_2B_1 HA004JS pool2 IVT2 IVT4 11 EL_2B_2 HA004JT_2B_2 HA004JT pool2 IVT2 IVT4 12 EL_2B_3 HA004JU_2B_3 HA004JU pool2 IVT2 IVT4 \end{Soutput} \end{Schunk} \subsubsection{One Way ANOVA} Let's begin to perform the analysis. Here we take a subset of probes (the first 20 probes) to perform the ANOVA. We chose \Robject{multIVT}, since it has 4 levels: IVT1, IVT2, IVT3, IVT4. \begin{Schunk} \begin{Sinput} > aResult <- doANOVA(eset[1:20, ], "multIVT") \end{Sinput} \begin{Soutput} Performing one-way ANOVA for multIVT ... Probes with S/N < 3 in at least 50% of samples in all subgroups of multIVT were not considered. [1] "One-way ANOVA results (9 probes) were written to file: Result_multIVT/DataResult/ANOVA_oneway_multIVT.csv" \end{Soutput} \end{Schunk} The results of this ANOVA analysis were saved into a file called \Robject{ANOVA\_oneway\_multIVT.txt}. Since we chose to perform one way ANOVA, \Robject{aResult} should be a vector of length 20 (we only used 20 probes). \begin{Schunk} \begin{Sinput} > aResult \end{Sinput} \begin{Soutput} 100002 100037 100039 100058 1.620190e-03 5.705200e-03 1.009991e-03 9.374965e-05 100060 100079 100089 100027 1.016003e-01 5.021948e-06 1.462993e-04 1.652952e-01 100052 5.952003e-01 \end{Soutput} \end{Schunk} To save the results into file, we use R \Rfunction{write.table} function. It takes an object whos data need to be written to file. Here our object is \Robject{aResult}. A file argument (i.e., the name of the file). In the following command, \Rfunarg{sep = ``,''} tells it to save the file as comma delimit format. Argument \Rfunarg{row.names = TRUE} tells it to write probeID into file as well. \begin{Schunk} \begin{Sinput} > write.table(aResult, file = "multIVT_oneWay_ANOVA.csv", + sep = ",", row.names = TRUE) \end{Sinput} \end{Schunk} You can check the content of the file. The first column is probeID, the second column is p values from one way ANOVA analysis. \subsubsection{Two Way ANOVA} Let's perform two way ANOVA. Here we pass 2 factors ('pool' and 'IVT'), and as before we will use only 20 probes for demonstration purpose. \begin{Schunk} \begin{Sinput} > bResult <- doANOVA(eset[1:20, ], "pool", "IVT") \end{Sinput} \begin{Soutput} Performing two-way ANOVA for pool and IVT ... Probes with S/N < 3 in at least 50% of samples in all subgroups of pool and IVT were not considered. [1] "Two-way ANOVA results (8 probes) were written to file: Result_pool_IVT/ANOVA_pool_IVT.csv" \end{Soutput} \end{Schunk} The results of the ANOVA analysis were saved into file \Robject{ANOVA\_pool\_IVT.txt}. Let's check if the \Robject{bResult} is a matrix \begin{Schunk} \begin{Sinput} > dim(bResult) \end{Sinput} \begin{Soutput} [1] 8 3 \end{Soutput} \end{Schunk} Let's get the p values for the two way ANOVA \begin{Schunk} \begin{Sinput} > bResult \end{Sinput} \begin{Soutput} p(pool) p(IVT) p(pool*IVT) 100002 4.215936e-03 0.002562953 3.098940e-02 100037 1.116917e-02 0.156183104 5.328180e-03 100039 5.423037e-04 0.173732795 5.336480e-03 100058 6.514385e-04 0.009500397 8.888504e-05 100060 7.294050e-01 0.051155428 1.058891e-01 100079 3.244984e-05 0.001540973 5.688470e-06 100089 6.787365e-04 0.961322381 8.243031e-05 100052 7.863245e-01 0.206235388 8.623153e-01 \end{Soutput} \end{Schunk} To save the result to file, we can use the following command \begin{Schunk} \begin{Sinput} > write.table(bResult, file = "pool_IVT_twoway_ANOVA.csv", + sep = ",", row.names = TRUE, col.names = NA) \end{Sinput} \end{Schunk} %- End function doANOVA %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %- Function doLPE \section{LPE analysis} The function \Rfunction{doLPE} in package \Rpackage{ABarray} is based on \Rpackage{LPE} package from www.bioconductor.org. The \Rpackage{LPE} package describes local-pooled-error (LPE) test for identifying differentially expressed genes especially for low number of replicates (2-3) (See Jain:2003). The test statistics is calculated as follows, which calculates difference in medians between the two experimental conditions: \begin{huge} $\textit{Z} = \frac{Med_{1} - Med_{2}}{\sigma_{pooled}}$ \end{huge} where the variance is estimated as follows: \begin{huge} $\sigma^{2}_{pooled} = \frac{\pi}{2}[\frac{\sigma^{2}_{1}(Med_{1})}{n1} + \frac{\sigma^{2}_{2}(Med_{2})}{n2}]$ \end{huge} The function \Rfunction{doLPE} will automatically adjust p values using Benjamini-Hochberg FDR approach. Jain et. al. (2003) Local pooled error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics, 19, 1945-1951 \subsection{Required Data Object} The function \Rfunction{doANOVA} works on an \Rclass{ExpressionSet} object. \begin{itemize} \item \Robject{ExpressionSet} object. The object can be generated from function \Rfunction{ABarray} in package \Rpackage{ABarray}. For additional information about functionality of \Rfunction{ABarray}, refer to document from \Rfunction{ABarray}. The object can also be generated from other packages in \Rpackage{Bioconductor} project. See www.bioconductor.org for more information about other packages. The S/N ratio information is stored in \Robject{assayDataElement(eset, "snDetect")}. This will automatically be created from \Rfunction{ABarray} in \Rpackage{ABarray} package. The analysis will filter probes based on S/N >= 3 in 75\% of samples. If S/N is not available, all probes will be used for analysis (no filtering will be applied). \item \Rfunarg{group} parameter. The group is a factor that the test should be performed on. The name of the group should correspond to one of the names in experiment design file that define sample type. If there are 3 or more levels for the group, the members of the group should be specified for LPE test. \item \Rfunarg{member} parameter. Where there is more than 2 levels in the \Rfunarg{group} factor, any two members that need to be compared should be specified in this paramter. For example, \Rfunarg{member <- c('drug1', 'drug4')} will indicate to perform LPE between \Rfunarg{drug1} and \Rfunarg{drug4} only. \end{itemize} \subsection{doLPE Analysis Output} The function \Rfunction{doLPE} returns a dataframe with p values and FDR adjusted p values for each probes (if S/N available, filtered probes). It also produces two files for preset FDR = 0.01 and FDR = 0.05. \subsection{doLPE Function Demonstration} \subsubsection{Required Data} First let's see what the \Robject{eset} contains \begin{Schunk} \begin{Sinput} > eset \end{Sinput} \begin{Soutput} Expression Set (exprSet) with 32878 genes 6 samples phenoData object with 3 variables and 6 cases varLabels sampleName: read from file assayName: read from file tissue: read from file \end{Soutput} \end{Schunk} Let's check the experiment design and pick factors for testing \begin{Schunk} \begin{Sinput} > pData(eset) \end{Sinput} \begin{Soutput} sampleName assayName tissue 1 A1 HB003U2_9/2/05_1:55_PM TissueA 2 A2 HB003U3_9/2/05_2:11_PM TissueA 3 A3 HB003U8_9/2/05_2:26_PM TissueA 4 B1 HB003TV_9/2/05_2:41_PM TissueB 5 B2 HB003TW_9/2/05_2:55_PM TissueB 6 B3 HB003TZ_9/2/05_3:10_PM TissueB \end{Soutput} \end{Schunk} \subsubsection{LPE analysis} Let's begin to perform the analysis. \begin{Schunk} \begin{Sinput} > aLPE.result <- doLPE(eset, "tissue") \end{Sinput} \begin{Soutput} Performing LPE analysis for tissue ... [1] "LPE results were written to file Result_tissue/DataResult/LPE_tissue_TissueB_TissueA.csv" \end{Soutput} \end{Schunk} Let's take a few entries in aLPE.result to see what the dataframe looks like. \begin{Schunk} \begin{Sinput} > aLPE.result[1:4, ] \end{Sinput} \begin{Soutput} x.x.B1 x.x.B2 x.x.B3 median.1 std.dev.1 y.y.A1 y.y.A2 y.y.A3 median.2 std.dev.2 median.diff 520680 21.78399 21.78399 21.78399 21.78399 0.07155299 21.78399 21.78399 21.78399 21.78399 0.11310696 0.0000000 643139 21.62803 21.62801 21.62809 21.62803 0.07155299 21.62810 21.50560 21.43475 21.50560 0.11310696 0.1224310 701229 21.50559 21.50557 21.50566 21.50559 0.07155299 21.43456 21.62804 21.62819 21.62804 0.11310696 -0.1224466 128174 21.43441 21.43437 21.43454 21.43441 0.07155299 18.45148 18.63286 18.47317 18.47317 0.05467501 2.9612377 pooled.std.dev abs.z.stats p.adj.adjp.rawp p.adj.adjp.BH p.adj.index z.real 520680 0.09682188 0.000000 1.0000000 1.0000000 17602 0.000000 643139 0.09682188 1.264497 0.2060517 0.2422529 18865 1.264497 701229 0.09682188 1.264658 0.2059938 0.2421976 1972 1.264658 128174 0.06514452 45.456439 0.0000000 0.0000000 8984 45.456439 \end{Soutput} \end{Schunk} To selectively view the result \begin{Schunk} \begin{Sinput} > aLPE.result[1:5, c("median.1", "median.2", "median.diff", "p.adj.adjp.rawp", "p.adj.adjp.BH")] \end{Sinput} \begin{Soutput} median.1 median.2 median.diff p.adj.adjp.rawp p.adj.adjp.BH 520680 21.78399 21.78399 0.0000000 1.0000000 1.0000000 643139 21.62803 21.50560 0.1224310 0.2060517 0.2422529 701229 21.50559 21.62804 -0.1224466 0.2059938 0.2421976 128174 21.43441 18.47317 2.9612377 0.0000000 0.0000000 703444 21.34799 19.30107 2.0469234 0.0000000 0.0000000 \end{Soutput} \end{Schunk} %- End function doLPE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %- Function doVennDiagram \section{Creating Venn Diagram: doVennDiagram} Venn Diagrams are frequently used for examining similarities and differences in gene lists generated from different analysis. It is made up of two or more overlapping circles. The functionality of the \Rfunction{doVennDiagram} in \Rpackage{ABarray} relies on some functions provided in \Rpackage{limma} package. \subsection{Required Data} \Rfunction{doVennDiagram} accepts two or three vectors of gene lists. If two gene lists are provided, two way Venn diagram is produced. If three gene lists are provided, three way Venn diagram is produced. \begin{itemize} \item list1. The first vector of gene list. \item list2. The second vector of gene list. \item list3. Optional third vector of gene list. \item names. The names for use for each list in diagram. \end{itemize} \subsection{doVennDiagram Function Demonstration} We will demonstrate two-way and three-way Venn diagram by creating three gene lists. Note: the result of joined lists are automatically saved into file. \begin{itemize} \item list1 contains probe names from index 1 to 100 \item list2 contains probe names from index 61 to 160 \item list3 contains probe names from index 81 to 200 \end{itemize} After these three gene lists are created, we perform \Rfunction{doVennDiagram}. \begin{Schunk} \begin{Sinput} > list1 <- geneNames(eset)[1:100] > list2 <- geneNames(eset)[61:160] > list3 <- geneNames(eset)[81:200] > length(list1) \end{Sinput} \begin{Soutput} [1] 100 \end{Soutput} \begin{Sinput} > length(list2) \end{Sinput} \begin{Soutput} [1] 100 \end{Soutput} \begin{Sinput} > length(list3) \end{Sinput} \begin{Soutput} [1] 120 \end{Soutput} \end{Schunk} \subsubsection{Two Way Venn Diagram} There are 40 probes common between \Robject{list1} and \Robject{list2}. Let's create Venn diagram between \Robject{list1} and \Robject{list2}. \begin{Schunk} \begin{Sinput} > doVennDiagram(list1, list2) \end{Sinput} \begin{Soutput} [1] "List information is written to file Venn_list_1+2.csv" \end{Soutput} \end{Schunk} \includegraphics[width=0.5\textwidth]{Figure/venn-twoWay} \subsubsection{Three way Venn Diagram} There are 40 probes common between \Robject{list1} and \Robject{list2}, and 20 probes between \Robject{list1} and \Robject{list3}, and 80 probes between \Robject{list2} and \Robject{list3}. Let's create Venn diagram between \Robject{list1} and \Robject{list2}. \begin{Schunk} \begin{Sinput} > doVennDiagram(list1, list2, list3) \end{Sinput} \begin{Soutput} [1] "List information is written to file Venn_list_1+2+3.csv" \end{Soutput} \end{Schunk} \includegraphics[width=0.5\textwidth]{Figure/venn-threeWay} %- End function doVennDiagram \end{document}