% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{logitT primer} %\VignetteKeywords{DifferentialExpression, Affymetrix} %\VignetteDepends{logitT, affy, SpikeInSubset} %\VignettePackage{logitT} \documentclass[12pt]{article} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{amsmath} \usepackage{amsthm} \usepackage{url} %\usepackage{multirow} %\usepackage{bm} %\usepackage{listings} \usepackage{verbatim} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in %\headheight=-.3in %\newcommand{\scscst}{\scriptscriptstyle} %\newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \author{Tobias Guennel} \begin{document} \title{Description of Logit-t: Detecting Differentially Expressed Genes Using Probe-Level Data} \maketitle \tableofcontents \newpage \section{Introduction} The \textsl{Logit-t} algorithm (Lemon \textsl{et al.}, 2003), also called \textsl{Median t} (Hess \textsl{et al.}, 2007), was developed as an alternative to existing statistical methods for identifying differentially expressed genes. Unlike other commonly used algorithms for identifying differentially expressed genes, for example SAM (Tusher \textsl{et al.}, 2001), the \textsl{Logit-t} algorithm does not require the calculation of an expression summary for each probe set prior to statistical analysis. Using spike-in datasets, the \textsl{Logit-t} algorithm was compared to statistical testing of probe set expression summaries MAS5 (Hubbell \textsl{et al.}, 2002), RMA (Irizarry \textsl{et al.},2003), and dChip (Li \textsl{et al.},2001) and was found to to have better sensitivity, specificity, and positive predicted value (PPV) (Lemon \textsl{et al.}, 2003).\\ The \textsl{Logit-t} algorithm proceeds by first normalizing probe intensities $i$ within probe set $j$ for array $k$ using the logit-log transformation \begin{equation} logit(y_{ijk})= log\left(\frac{y_{ijk}-N_{k}}{A_{k}-y_{ijk}}\right). \end{equation} The parameters $A_{k}$ and $N_{k}$ are estimated for each array by \begin{equation*}\begin{split} A_{k}&=\max_{i,j}(y_{ijk})+0.001*(\max_{i,j}(y_{ijk})-\min_{i,j}(y_{ijk}))\\ N_{k}&=\min_{i,j}(y_{ijk})-0.001*(\max_{i,j}(y_{ijk})-\min_{i,j}(y_{ijk})), \end{split}\end{equation*} representing the maximum probe intensity (saturation) and the minimum probe intensity (background), respectively. Following this transformation, probe level intensities for each array were then standardized using a Z-transformation. Lemon \textsl{et al.} (2003). \\ For differential expression analysis, Student's $t$-tests are then performed for each transformed PM probe within a probe set and the \textsl{Logit-t} is defined as the median $t$-statistic among all $t$-statistics for the probe set. The $t$-value cutoff corresponding to $p < 0.01$ using the df of the comparison was used as threshold for making detection calls of differential expression or no differential expression in the original paper (Lemon \textsl{et al.}, 2003).\\ The \textsl{Logit-t} algorithm was originally coded as a stand-alone application in C++ provided by Dr William J. Lemon. In order to make it available for a broader spectrum of users, the algorithm was implemented in the programming environment \textbf{R}, an open source programming environment (R Development Core Team, 2008) and the most commonly used software package for gene expression data analysis. There are no differences in results between the stand-alone application and its implementation in the \textbf{R} programming environment as most of the model fitting C++ code has been retained. However, the implementation into the \textbf{R} programming environment simplifies the use of the algorithm and the access to its results significantly compared to the stand-alone application. In order to make it available for a broader spectrum of users, the algorithm was implemented in \textbf{R}, an open source programming environment (R Development Core Team, 2008) and the most commonly used software package for gene expression data analysis. There are no differences in results between the stand-alone application and its implementation in the \textbf{R} programming environment as most of the model fitting C++ code has been retained. However, the implementation into the \textbf{R} programming environment simplifies the use of the algorithm and the access to its results significantly compared to the stand-alone application. The \Rpackage{logitT} package implements the Logit-t algorithm in the \textbf{R} programming environment, making it available to users of the Bioconductor\footnote{\url{http://www.bioconductor.org/}} project. \section{What's new in this version} This is the first release of this package. \section{Preparing data for use with logitT} The \textsl{logitT} package accepts data from Affymetrix CEL files that have been read into the \textbf{R} programming environment using the affy (Gautier \textsl{et al.}, 2004) library and stored in an \Robject{AffyBatch} object. A CEL file contains, among other information, a decimal number for each probe on the chip that corresponds to its intensity. The Bioconductor packages \Rpackage{affy}, \Rpackage{Biobase}, and \Rpackage{tools}(Gentleman \textsl{et al.}, 2004) are automatically loaded by \Rpackage{logitT} to provide functions for reading Affymetrix data files into \textbf{R}. The following steps create an \Robject{AffyBatch} object. \begin{enumerate} \item Create a directory containing all CEL files relevant to the planned analysis. \item If using Linux / Unix, start R in that directory. \item If using the Rgui for Microsoft Windows, make sure your working directory contains the CEL files (use ``File -> Change Dir'' menu item). \item Load the library. <>= library(logitT) @ \item Read in the data and create an \Robject{AffyBatch} object. \end{enumerate} Additional information regarding the \Rfunction{ReadAffy} function and detailed description of the structure of CEL files can be found in the \Rpackage{affy} vignette. \section{Generate t-statistics and probe set expression summaries} To demonstrate the use of the \Rpackage{logitT} \textbf{R} package, the publicly available Affymetrix Latin Square data set (Affymetrix, 2008) was used. The Latin Square design for this data set consists of 14 spiked-in gene groups in 14 experimental groups on HG-U95Av2 arrays. The concentration of the 14 gene groups in the first experiment is 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024pM. Each subsequent experiment rotates the spike-in concentrations by one group; i.e. experiment 2 begins with 0.25pM and ends at 0pM, on up to experiment 14, which begins with 1024pM and ends with 512pM. Each experiment contains at least 3 replicates. \\ Two of the 14 experimental groups, i.e. group one and group two with 3 replicates each, are available in the experimental data package \textsl{SpikeInSubset} as the data set \textsl{spikein95}. The package and subsequently the data set can be loaded into the \textbf{R} programming environment using \textsl{R$>$ library(SpikeInSubSet)} and \textsl{R$>$ data(spikein95)}. To compare the two groups of three arrays, where the first three arrays correspond to group 1 and the latter three to group 2, we define the groupvector as before. To invoke the \textsl{logitTAffy} method, the syntax is: <>= library(SpikeInSubset) data(spikein95) ## get the example data groupvector<-c("A","A","A","B","B","B") ## specify group affiliations logitTex<-logitTAffy(spikein95, group=groupvector) @ The first few $t$-statistics are: <<>>= logitTex[1:10] @ The input arguments for the function call are: \begin{description} \item[object] -- object of type \Robject{AffyBatch} \item[group] -- vector describing the group affiliation for each array \end{description} The vector describing the group affiliation for each array must be of the same length as the number of CEL files contained in the \textsl{AffyBatch} object. For example, suppose six CEL files are stored in the \textsl{AffyBatch} object where the first three arrays are from one condition and the last three arrays are from a second condition. The object \textsl{groupvector} in the function call would be <>= groupvector<-c("A","A","A","B","B","B") @ Note that the group labels can be defined as desired provided only one unique label is used for identifying a specific group.\\ After invoking the \textsl{logitTAffy} function, the result is a named vector containing the t-statistics for each probe set. \section{Using Logit-t in gene expression analysis} As described in the introduction, Student's $t$-tests are then performed for each transformed PM probe within a probe set and the \textsl{Logit-t} is defined as the median $t$-statistic among all $t$-statistics for the probe set. The $t$-value cutoff corresponding to $p < 0.01$ using the df of the comparison was used as threshold for making detection calls of differential expression or no differential expression in the original paper (Lemon \textsl{et al.}, 2003). Lemon et al. suggested to use the df of the comparison to obtain p-values for the calculated t-statistics. In this example, $n_{1}=3$ and $n_{2}=3$ and therefore df$=n_{1}+n_{2}-2=4$. Two-sided p-values for each probe set can be obtained using: <<>>= pvals <- (1-pt(abs(logitTex),df=4))*2 ## calculate p-values signifgenes<-names(logitTex)[pvals<0.01] ## find probe sets with p-values smaller than 0.01 signifgenes @ \section{Version history} \begin{description} \item[1.0.0] initial development version \end{description} \section{Acknowledgements} We thank Sandy Liyanarachchi for providing the original C++ implementation of the Logit-t algorithm as described by Dr William J. Lemon, Dr Ming You, and Dr Sandya Liyanarachchi. This work was partially supported by National Institute of Diabetes and Digestive and Kidney Diseases R01DK069859. \begin{thebibliography}{} \bibitem[Affymetrix, 2008]{Affy08} Affymetrix (2008) \href{http://www.affymetrix.com}{[http://www.affymetrix.com]}. \bibitem[Gautier {\it et~al}., 2004]{Gautier2004} Gautier,L. et al. (2004) Affy--analysis of Affymetrix GeneChip data at the probe level, {\it Bioinformatics}, {\bf 20(3)}, 307-315. \bibitem[Gentleman {\it et~al}., 2004]{Gentleman2004} Gentleman,R.C. et al. (2004) A high performance test of differential gene expression for oligonucleotide arrays, {\it Genome Biology},{\bf 4},R:67. \bibitem[Hess {\it et~al}., 2007]{Hess2007} Hess,A. et al. (2007) Fisher'scombined p-value for detecting differentially expressed genes using Affymetrix expression arrays, {\it BMC Genomics},{\bf 8}:96. \bibitem[Hubbell {\it et~al}., 2002]{Hubbell2002} Hubbell,E. et al. (2002) Analysis of high density expression microarrays with signed-rank call algorithms, {\it Bioinformatics},{\bf 18},1593-1599. \bibitem[Irizarry {\it et~al}., 2003]{Irizarry2003} Irizarry,R.A. et al. (2003) Summaries of Affymetrix GeneChip probe level data, {\it Nucleic Acid Res},{\bf 31},E:15. \bibitem[Lemon {\it et~al}., 2003]{Lemon2003} Lemon,W.J. et al. (2003) Bioconductor: open software development for computational biology and bioinformatics, {\it Genome Biology},{\bf 5},R:80. \bibitem[R Development Core Team, 2008]{R2008} R Development Core Team (2008), R: A Language and Environment for Statistical Computing Version 2.7.1. R Foundation for Statistical Computing, Vienna, Austria. %\bibitem[Tusher {\it et~al}., 2001]{Tusher2001} Tusher,V.G. et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response, {\it PNAS},{\bf 98},5116-5121. \end{thebibliography} \end{document}