%% \VignetteIndexEntry{Introduction to illuminaio} \documentclass{article} <>= BiocStyle::latex() @ \title{illuminaio} \author{M.L. Smith, K.D. Hansen} \begin{document} \maketitle \section*{Introduction} \Biocpkg{illuminaio} is designed to provide a single package containing routines for importing data from Illumina BeadArray platforms into \R{}. The intention is for \Biocpkg{illuminaio} to provide developers of downstream analysis packages with a mechanism for extracting all possible information from IDAT files in a relatively straight forward fashion. The choice of data to retain and how it should be stored is then left for the end user to decide. This vignette gives examples of how data files can be read and discusses the values that are returned, which will vary depending upon the BeadArray platform used. It also demonstrates that the values extracted by \Biocpkg{illuminaio} are the same as those returned when using Illumina's own GenomeStudio software. \section*{Citing illuminaio} If you use the package, please cite the associated paper \cite{illuminaio}. \section*{Reading Data} \subsection*{Expression Array} The code below gives an example of how an IDAT file (in this case from a Human expression array) can be read, and explores the information that is extracted. <>= library(illuminaio) library(IlluminaDataTestFiles) idatFile <- system.file("extdata", "idat", "4343238080_A_Grn.idat", package = "IlluminaDataTestFiles") idat <- readIDAT(idatFile) @ The first two lines above load libraries require for this vignette. Firstly this package and then \Biocpkg{IlluminaDataTestFiles}, a small data package containing the example files that are used throughout this vignette. The third line generates the complete path to one such file. We then use the function \Rfunction{readIDAT} to read the file. This function take only a single argument, the file's path. Although IDAT files are found in multiple formats, \Rfunction{readIDAT} is able to determine this and will call the appropriate reading routine internally. The returned object is a list, the contents of which is explored below. <>= names(idat) @ Using the \Rfunction{names} command above lists the data extracted from the file. In this case \texttt{Barcode} and \texttt{Section} are the identifiers for the BeadChip and can usually be found in the file name as well. \texttt{ChipType} describes the BeadArray platform this file was generated from. The \texttt{RunInfo} slot holds information about the processing performed on the array, most notably the date upon which it was scanned. Such information may be useful if one is attempting to identify batch effects amongst multiple samples. \texttt{Quants} is where the per-bead-type values are found. The commands below assign the \texttt{Quants} values to a new variable for convenience, and then print the first six entries from the resulting data.frame. <>= idatData <- idat$Quants head(idatData) @ We can see that for this expression array a total of 10 values are returned for each bead-type. The column headers are pulled directly from the IDAT file and do not directly match with how things are commonly labelled in GenomeStudio, although for the most part they are relatively easy to decipher. The \texttt{CodeBinData, MeanBinData} and \texttt{NumGoodBeadsBinData} columns are those that are reported by default in GenomeStudio and correspond the ProbeID, AVG\_Signal and NBEADS values respectively. \texttt{DevBinData} gives the standard deviation for the bead-type and can be used the generate the BEAD\_STDERR values GenomeStudio reports. The remaining columns give additional information and are discussed in the file \texttt{EncryptedFormat.pdf} that also accompanies this package. \subsection*{Genotyping Array} The example above focused on reading an IDAT file produced by scanning an expression array. To highlight some of the differences in output we shall now read a file from an Infinium genotyping array. \fixme{I should check exactly which platform this is from - MLS} <>= genotypeIdatFile <- system.file("extdata", "idat", "5723646052_R02C02_Grn.idat", package = "IlluminaDataTestFiles") genotypeIdat <- readIDAT(genotypeIdatFile) names(genotypeIdat) @ The reading of the file proceeds in exactly the same way as before and a list is again returned. However, there are several more data fields returned. Again \texttt{Quants} is where the per-bead-type values are stored. <>= head(genotypeIdat$Quants) @ For genotyping arrays only the four typically reported values are contained within the IDAT file and their column names more closely resemble those that are found in GenomeStudio. \section*{Comparison with GenomeStudio} Now we shall compare the values extracted by \Biocpkg{illuminaio} with those reported by Illumina's GenomeStudio software, to ensure our file reading routines are performing correctly. \subsection*{Importing GenomeStudio Values} <>= gsFile <- system.file("extdata", "gs", "4343238080_A_ProbeSummary.txt.gz", package = "IlluminaDataTestFiles") gStudio <- read.delim(gsFile, sep = "\t", header = TRUE) idatData <- idatData[which(idatData[,"CodesBinData"] %in% gStudio[,"ProbeID"]),] gStudio <- gStudio[match(idatData[,"CodesBinData"], gStudio[,"ProbeID"]),] @ The first line above reads a file that was produced by reading the IDAT file into GenomeStudio and then immediately exporting that data as a tab separated text file. No other processing was performed on the data. \fixme{Currently this file is stored on a web server, although the intention is to include it in the \Biocpkg{illuminaDataTestFile} package.} However, the two datasets are not quite compatible at the moment. Reading directly from an IDAT file returns values for several bead-types that serve as internal controls and are not annotated by Illumina. These bead-types are excluded automatically by GenomeStudio, so the second line above identifies and removes them from our \Biocpkg{illuminaio} data. The two data sets should now contain the same number of bead-types. The inclusion of these extra bead-types is not the only difference, they are also in different orders. Bead-types are extracted in numerical order from IDAT files, but the GenomeStudio output is sorted alphabetically. The third line reorders the GenomeStudio values to match those from \Biocpkg{illuminaio}, making our comparison slightly easier. \subsection*{Performing Comparison} The code below produces the two plots seen in Figure 1. <>= par(mfrow = c(1,2)) plot(idatData[, "MeanBinData"], gStudio[, "X4343238080_A.AVG_Signal"], xlab = "illuminaio", ylab = "GenomeStudio") identical(idatData[, "MeanBinData"], gStudio[, "X4343238080_A.AVG_Signal"]) hist(idatData[, "MeanBinData"]- gStudio[, "X4343238080_A.AVG_Signal"], breaks = 100, main = "", xlab = "Difference") @ \incfig{illuminaio-figureComparingValues}{\textwidth}{}{Comparing values obtained by \Biocpkg{illuminaio} and GenomeStudio} The first plot shows the summarized bead-intensity values extracted by \Biocpkg{illuminaio} on the horizontal axis against Genomestudio's values on the vertical axis. We can see they are highly similar. However, they are not identical, as shown by the third line above. The second plot visualises the distribution of the differences between the two sets of values, showing them to be small. These are most likely introduced by rounding performed by GenomeStudio that is not carried out by \Biocpkg{illuminaio}. \pagebreak %--------------------------------------------------------- \section*{Session info} %--------------------------------------------------------- Here is the output of \Rfunction{sessionInfo} on the system on which this document was compiled: <>= toLatex(sessionInfo()) @ \bibliography{illuminaio} \end{document}