%\VignetteIndexEntry{1. Primer} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[11pt]{article} \usepackage{amsmath} \usepackage{amssymb} \usepackage{epsfig} \usepackage{lscape} \usepackage{graphics} \usepackage{Sweave} \textwidth=5.4in \textheight=8.8in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.5in \setcounter{tocdepth}{2} \setcounter{secnumdepth}{3} \begin{document} \title{Affymetrix Quality Assessment and Analysis Tool} \author{Xiwei Wu and Xuejun Arthur Li} \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Affymetrix GeneChip is a commonly used tool to study gene expression profiles. The purpose of this package is to provide a comprehensive and easy-to-use tool for quality assessment and to identify differentially expressed genes in the Affymetrix gene expression data. Initial data quality assssment is achieved by a series of QC plots in an HTML report for easy visualization. More importantly, functions are provided for biologists who have little statistical background to generate design and contrast matrices for simple, as well as complicated, designed experiments, such as one factor with multiple levels, multiple factors with interactions, or one or more factors with covariates. Users can select either an ordinary linear regression model, LIMMA~\cite{limma}, or permutation test for differentially expressed gene identification. Differentially expressed genes are reported in tabular format with annotations hyperlinked to online biological databases. Wrapper functions are also designed to make analysis even more simplified. This guide will use an example dataset to demonstrate how to perform analysis of experiments with commonly used designs by using this package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Data} We will use the dataset \texttt{estrogen} (8 Affymetrix genechips) downloaded from the Bioconductor that includes 12,625 genes to demonstrate how to use this vignette. This dataset has also been used as example data in the \verb"factDesign" and \verb"LIMMA" vignette. The investigators are interested in the effect of estrogen on the genes in ER+ breast cancer cells and how they differ across different time periods. In this example, there are two time periods, 10 hours and 48 hours. <<>>= library(AffyExpress) library(estrogen) datadir <- system.file("extdata", package = "estrogen") phenoD<-read.AnnotatedDataFrame("phenoData.txt", path=datadir, sep="", header=TRUE, row.names="filename") raw<-ReadAffy(filenames=rownames(pData(phenoD)), phenoData= phenoD, celfile.path=datadir) pData(raw) @ It is very important to have the correct phenotype before doing further analysis. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Quality Assessment} To run the quality assessment, run the following function <<>>= AffyQA (parameters=c("estrogen", "time.h"), raw=raw) @ The \verb"AffyQA" function will create a quality assessment report in AffyQA.html that contains a set of assessment plots, including Affymetrix recommended quality assessment, RNA quality assessment, sample quality assessment, quality diagnostic using affyPLM (pseudo-chip images and NUSE and RLE plots) in your current working directory. @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preprocessing and Filtering} We can use the \verb"pre.process" function to convert the \verb"AffyBatch" data set into an \verb"ExpressionSet" using either \verb"RMA" or \verb"GCRMA" methods. Suppose that we are using the \verb"RMA" method. <<>>= normaldata<-pre.process(method="rma", raw=raw) dims(normaldata) @ The next step, we will filter the normalized data by using the \verb"Filter" function. Suppose that we would like to filter the data based on at least 2 of the chips whose expression value is greater than 6. <<>>= filtered<-Filter(object=normaldata, numChip=2, bg=6) @ Now, we have 9038 genes left. The examples below will be based on these 9038 genes. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Identifying Differentially Expressed Genes} Identifying differentially expressed genes depends on a researcher's interest and applying correct statistical models during the analysis process. We will illustrate a few basic statistical models on this data set. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Single Factor} Suppose we would like to identify how differentially expressed genes respond to estrogen regardless of time period. Analysis on a single categorical variable is the same as One Way ANOVA. Since we only have two levels, \verb"present" and \verb"absent", for the \verb"estrogen" variable, this type of analysis is also equivalent to a two-sample t test. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Design Matrix and Contrast Matrix} To run the analysis, we need to create a design and a contrast matrix. One of the major strengths of this package that we can use the built-in function to create the design matrix and the contrast matrix using standard statistics approaches which is different from the design matrix from the \verb"LIMMA" package. To create a design matrix, we will use the \verb"make.design" function. <<>>= design<-make.design(target=pData(filtered), cov="estrogen") design @ Notice that the name of the second column of the design matrix is \texttt{estrogen/present}, where \texttt{estrogen} is the name of the variable and \texttt{present} tells us that \texttt{present} corresponds to 1. Thus, the design matrix above is equvalent to the equation below: \begin{equation} \\y=\alpha+\beta_{E}x_{E}+\epsilon \end{equation} where \[ x_{E}=\left\{ \begin{array}{ll} 1 & \mbox{if estrogen = "present"}\\ 0 & \mbox{if estrogen = "absent"} \end{array} \right. \] Next we need to create a contrast matrix. Since we are comparing \verb"present" versus \verb"absent", we will create the following contrast: <<>>= contrast<-make.contrast(design.matrix=design, compare1="present", compare2="absent") contrast @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Analysis} Once the design matrix and contrast matrix are created, we can run the analysis by using the \verb"regress" function. There are three types of regression methods that are being supported: \verb"LIMMA" (computing moderated t-statistics and log-odds of differential expressions by empirical Bayes shrinkage of the standard errors towards a common value), permutation test (resampling the phenotype), and ordinary linear regression. Also, we can apply multiple comparison corrections by using the \verb"adj" option, such as \verb"fdr". The default value for the \verb"adj" is \verb"none" <<>>= result<-regress(object=filtered, design=design, contrast=contrast, method="L", adj="fdr") @ Here are the first three genes of the result <<>>= result[1:3,] @ For the next step, we can select differentaly expressed genes based on p value and/or fold change. Suppose that we would like to select genes with a p value <0.05 and a fold change value greater than 1.5. <<>>= select<-select.sig.gene(top.table=result, p.value=0.05, m.value=log2(1.5)) @ The \verb"select.sig.gene" function adds an additional column, \verb"significant", which gives a value of either TRUE or FALSE indicating whether the gene is differentially expressed based on your selection criteria. In this example, there are 381 differentially expressed genes being selected. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Output Your Result} To output the differentially expressed genes along with annotations to an HTML file in your current working directory, we can use the \verb"result2html" function. <<>>= result2html(cdf.name=annotation(filtered), result=select, filename="singleFactor") @ %The HTML file would look like the one below: %\newpage %\rotatebox{90}{\resizebox{9in}{!}{\includegraphics[0in,0in][15in,8in]{singleFactor.jpg}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{A Wrapper Function} There is a wrapper function, \texttt{AffyRegress}, that can acomplish all of the above steps together including: create a design and contrast matrix, run regression, select differentaly expressed genes, and output the differentally expressed gene to an html file. <<>>= result.wrapper<-AffyRegress(normal.data=filtered, cov="estrogen", compare1="present", compare2="absent", method="L", adj="fdr", p.value=0.05, m.value=log2(1.5)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Single Factor Adjusting for Covariates} We can analyze the single factor effect (\verb"estrogen" in our example) and adjust for other covariates (\verb"time.h"). This is not the researcher's interest. However, we will present this example only for illustration purposes. This statistical approach is the same as Randomized block design which is used to iosolate the variation due to the nuisance variable. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Design Matrix and Contrast Matrix} We can create the following design and contrast matrix. <<>>= design.rb<-make.design(target=pData(filtered), cov=c("estrogen", "time.h")) design.rb @ The design matrix above is equvalent to the equation below: \begin{equation} \\y=\alpha+\beta_{E}x_{E}+\beta_{T}x_{T}+\epsilon \end{equation} where \[ x_{T}=\left\{ \begin{array}{ll} 1 & \mbox{if time = 48}\\ 0 & \mbox{if time = 10} \end{array} \right. \] <<>>= contrast.rb<-make.contrast(design.matrix=design.rb, compare1="present", compare2="absent") contrast.rb @ Once we obtained the design and contrast matrix, we can use similar steps documented in section 5.1.2 to select differentially expressed genes. We can also use the wrapper function \verb"AffyRegress" to complete all the steps at once. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Multifactor Analysis I} One possible interest is to examine the estrogen effect at either the 10 hour period or 48 hour period. One way of conducting the analysis is to subset the data set into two groups, with one containing data at the 10 hour period and the other containing data at the 48 hour period. Then we can use single-factor analysis for each group. Instead of spliting the data into two groups, we can use a more complex model like the one below: \begin{equation} \\y=\alpha+\beta_{E}x_{E}+\beta_{T}x_{T}+\beta_{E:T}x_{E}x_{T}+\epsilon \end{equation} The interaction term between estrogen and time allows us to analyze the estrogen effect at different time periods. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Design Matrix and Contrast Matrix} We will first create the following design matrix, which is equivalent to the model above. <<>>= design.int<-make.design(pData(filtered), cov=c("estrogen", "time.h"), int=c(1,2)) design.int @ Suppose we are interested in the estrogen effect at the 10 hour period. We will create the follwoing contrast: <<>>= contrast.10<-make.contrast(design.matrix=design.int, compare1 ="present", compare2="absent", level="10") contrast.10 @ Similarly to creating a contrast matrix at the 48 hour period, we will do the following: <<>>= contrast.48<-make.contrast(design.matrix=design.int, compare1 ="present", compare2="absent", level="48") contrast.48 @ \subsubsection{Analysis and Output Your Results} Next we can use the same \verb"regress", \verb"select.sig.gene", and \verb"result2html" function to complete the rest of the analysis for each time period. However, the wrapper function \verb"AffyRegress" will not work for this situation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Multifactor Analysis II} A commmon interest of biologists is to identify how genes respond to estrogen treatments differently at different time points. In statistical jargon, this is a test for interaction. Interaction is a statistical term refering to a situation when the relationship between the outcome and the variable of the main interest differs at different levels of the extraneous variable. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Design Matrix and Contrast Matrix} Like before, we need to create the design and contrast matrix to detect the interaction effect. The design matrix is the same as the one we just created \verb"design.int". However, we will create a new contrast in order to test the interaction effect. <<>>= contrast.int<-make.contrast(design.int, interaction=TRUE) contrast.int @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Identify Genes with Interaction Effect} We will use the same \texttt{regress} function to detect which genes have the interaction effect. <<>>= result.int<-regress(object=filtered, design=design.int, contrast=contrast.int, method="L") @ Suppose we would like to select genes having the interaction effect based on a p value <0.05, and fold change >1.5. Note the fold change here means the difference of the fold change of estrogen duringthe 48 hour period and the fold change of estrogen during the 10 hour period. <<>>= select.int<-select.sig.gene(result.int, m.value=log2(1.5), p.value=0.05) @ There are 224 genes that are significant. That means among these 224 genes, the fold change from \texttt{absent} and \texttt{present} differ in the two time periods. <<>>= sig.ID<-select.int$ID[select.int$significant==TRUE] sig.index<-match(sig.ID, rownames(exprs(filtered))) @ The variable \texttt{sig.index} contains the column index of significant genes in the \mbox{\texttt{ExpressionSet}}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Analysis on Genes with the Interaction Effect} For this group of genes, we can use the \texttt{post.interaction} function to analyze how we can further explore how genes are expressed differently at different time points. Since \texttt{time.h} has two factors, the data type returned from this function is \texttt{list} with length equaling two. Each component of the list has the same formatted table returned from the \texttt{gene.select} function. <<>>= result2<-post.interaction(strata.var="time.h",compare1="present", compare2="absent", design.int=design.int, object=filtered[sig.index,], method="L",adj="fdr", p.value=0.05, m.value=log2(1.5)) @ Among these 224 genes, 89 are differently expressed at the 10 hour period and 156 are differently expressed at the 48 hour period. Next, we can output the differentially expressed genes to an HTML file by using the \texttt{interaction.result2html} function. This HTML file is similar to the one created by \texttt{result2html}. It contains the \texttt{Log2 ratio} and the \texttt{P value} for each time period. In the last column of this file, it contains the \texttt{P value} for the interaction effect. <<>>= interaction.result2html(cdf.name=annotation(filtered), result=result2, inter.result=result.int) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Analysis on Genes without the Interaction Effect} For genes without the interaction effect, which means that they respond to estrogen treatment similarly between the two time points, we can use the same design and contrast matrix from section 5.1.1 to detect estrogen effect. <<>>= result1<-regress(object=filtered[-sig.index,], design=design, contrast=contrast, method="L", adj="fdr") select<-select.sig.gene(top.table=result1, p.value=0.05, m.value=log2(1.5)) @ \subsubsection{A Wrapper Function} The entire process above can also be acomplished by a wrapper function, \texttt{AffyInteraction}. This wrapper function will create a design matrix and contrast matrix for the interaction test. Then it tests for an interaction effect for each gene and identifies genes whose interaction test is significant. For genes with interaction effect, they'll fit a linear model for each gene in each time period. For genes that don't have interaction effect, they'll fit a linear model for each gene without splitting the data by time period. It will output signficant results in the end. <<>>= result3<-AffyInteraction(object=filtered, method="L", main.var="estrogen", strata.var = "time.h", compare1="present", compare2="absent", p.int=0.05, m.int=log2(1.5), adj.int="none", p.value=0.05, m.value=log2(1.5), adj="fdr") @ \begin{thebibliography}{99} \bibitem{limma} Smyth, G. K. (2005). Limma: linear models for microarray data. \emph{Bioinfomatics and Computational biology Solutions using R and Bioconductor}, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York \end{thebibliography} \end{document}